A look at aerosol formation using data mining techniques
ABSTRACT Atmospheric aerosol particle formation is frequently observed throughout the atmosphere, but despite various attempts of explanation, the processes behind it remain unclear. In this study data mining techniques were used to find the key parameters needed for atmospheric aerosol particle formation to occur. A dataset of 8 years of 80 variables collected at the boreal forest station (SMEAR II) in Southern Finland was used, incorporating variables such as radiation, humidity, SO2, ozone and present aerosol surface area. This data was analyzed using clustering and classification methods. The aim of this approach was to gain new parameters independent of any subjective interpretation. This resulted in two key parameters, relative humidity and preexisting aerosol particle surface (condensation sink), capable in explaining 88% of the nucleation events. The inclusion of any further parameters did not improve the results notably. Using these two variables it was possible to derive a nucleation probability function. Interestingly, the two most important variables are related to mechanisms that prevent the nucleation from starting and particles from growing, while parameters related to initiation of particle formation seemed to be less important. Nucleation occurs only with low relative humidity and condensation sink values. One possible explanation for the effect of high water content is that it prevents biogenic hydrocarbon ozonolysis reactions from producing sufficient amounts of low volatility compounds, which might be able to nucleate. Unfortunately the most important biogenic hydrocarbon compound emissions were not available for this study. Another effect of water vapour may be due to its linkage to cloudiness which may prevent the formation of nucleating and/or condensing vapours. A high number of preexisting particles will act as a sink for condensable vapours that otherwise would have been able to form sufficient supersaturation and initiate the nucleation process.
-
Citations (0)
-
Cited In (0)
Page 1
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
Atmos. Chem. Phys. Discuss., 5, 7577–7611, 2005
www.atmos-chem-phys.org/acpd/5/7577/
SRef-ID: 1680-7375/acpd/2005-5-7577
European Geosciences Union
Atmospheric
Chemistry
and Physics
Discussions
A look at aerosol formation using data
mining techniques
S. Hyv¨ onen1, H. Junninen2, L. Laakso2, M. Dal Maso2, T. Gr¨ onholm2, B. Bonn2,
P. Keronen2, P. Aalto2, V. Hiltunen3, T. Pohja3, S. Launiainen2, P. Hari4,
H. Mannila1, and M. Kulmala2
1Helsinki Institute of Information Technology, Basic Research Unit, Department of Computer
Science, University of Helsinki, P.O. Box 68, FIN–00014 University of Helsinki, Finland
2Department of Physics, University of Helsinki, P.O. Box 64, FIN–00014 University of Helsinki,
Finland
3Hyyti¨ al¨ a Forestry Field Station, Hyyti¨ al¨ antie 124, 35500 Korkeakoski, Finland
4Department of Forest Ecology, Faculty of Agriculture and Forestry, P.O. Box 27, FIN–00014
University of Helsinki, Finland
Received: 13 June 2005 – Accepted: 20 June 2005 – Published: 29 August 2005
Correspondence to: S. Hyv¨ onen (Saara.Hyvonen@cs.helsinki.fi)
© 2005 Author(s). This work is licensed under a Creative Commons License.
7577
Page 2
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
Abstract
Atmospheric aerosol particle formation is frequently observed throughout the atmo-
sphere, but despite various attempts of explanation, the processes behind it remain
unclear. In this study data mining techniques were used to find the key parameters
needed for atmospheric aerosol particle formation to occur.
of 80 variables collected at the boreal forest station (SMEAR II) in Southern Finland
was used, incorporating variables such as radiation, humidity, SO2, ozone and present
aerosol surface area. Data analysis were done using clustering and classification meth-
ods. The aim of this approach was to gain new parameters independent of any sub-
jective interpretation. This resulted in two key parameters, relative humidity and pre-
existing aerosol particle surface (condensation sink), capable in explaining 88% of the
nucleation events. The inclusion of any further parameters did not improve the results
notably. Using these two variables it was possible to derive a nucleation probability
function. Interestingly, the two most important variables are related to mechanisms
that prevent the nucleation from starting and particles from growing, while parame-
ters related to initiation of particle formation seemed to be less important. Nucleation
occurs only with low relative humidity and condensation sink values. One possible ex-
planation for the effect of high water content is that it prevents biogenic hydrocarbon
ozonolysis reactions from producing sufficient amounts of low volatility compounds,
which might be able to nucleate. Unfortunately the most important biogenic hydro-
carbon compound emissions were not available for this study. Another effect of water
vapour may be due to its linkage to cloudiness which may prevent the formation of nu-
cleating and/or condensing vapours. A high number of preexisting particles will act as
a sink for condensable vapours that otherwise would have been able to form sufficient
supersaturation and initiate the nucleation process.
A dataset of 8 years
5
10
15
20
25
7578
Page 3
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
1.Introduction
Atmospheric aerosol particle formation is observed in various environments: the upper
atmosphere (Eichkorn et al., 2002), marine environments (O’Dowd et al., 2002b), urban
air (M¨ onkk¨ onen et al., 2004; Dunn et al., 2004), remote areas (Koponen et al., 2002)
and boreal forests (M¨ akel¨ a et al., 1997). A recent overview article discusses these
observations in detail (Kulmala et al., 2004a).
Despite the numerous observations, the fundamental cause of atmospheric particle
formation remains in many cases unknown. Because of the physical and chemical
complexity of the atmosphere, it is often a difficult task to focus on the most relevant
process causing nucleation. But this focus is important, since without prior knowledge
it is difficult to identify the key variables. However, if a wide range of measurements is
carried out for a long period of time in one location, it may be possible to detect sub-
tle, previously unknown factors lying behind the atmospheric particle formation events.
Currently, long-term atmospheric aerosol measurements are conducted only at a few
stations (Ruuskanen et al., 2003; Sioutas et al., 2004; Aalto et al., 2001) or with a few
measured parameters like CO2(Keeling et al., 1982).
Even the few sets of long-term measurements have yielded many significant ad-
vances in atmospheric sciences. Such a recent finding is the occurrence of new atmo-
spheric particle formation taking place in boreal forest environments around 50–100
times a year. These newly formed particles affect the Earth’s radiation budget directly
by scattering and absorption (IPCC, 2001) and indirectly by acting as cloud condensa-
tion nuclei (Twomey, 1974).
Many studies have investigated the physical mechanisms, meteorological conditions
(Nilsson et al., 2001) and chemical compounds related to particle formation (Weber
et al., 1995; Korhonen et al., 1999; Birmili and Wiedensohler, 2000; O’Dowd et al.,
2002a; Bonn and Moortgat, 2003; Kulmala et al., 2004a). Earlier attempts have demon-
strated that favorable conditions for particle formation bursts include low atmospheric
water content, low preexisting particle concentration and high solar radiation (Boy and
5
10
15
20
25
7579
Page 4
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
Back Close
Full Screen / Esc
Print Version
Interactive Discussion
EGU
Kulmala, 2002).
However, many previous studies have been based on preconceptions of which pa-
rameters are important, in which case the role of other parameters may have been
overlooked. To avoid this, we have done a comprehensive study using data mining
techniques. We have collected from the SMEAR II station a dataset of eight years with
around 80 parameters, which were averaged over 30min. This dataset was studied
using different classification and clustering methods. In Sect. 2 we describe our mea-
surements and the quality control of our database. Due to the great number of previous
studies we do not describe everything exhaustively. Some derived variables such as
condensation sink are are discussed in more detail. The data analysis methods used
are described in Sect. 3. We present the main results obtained by the application of
these methods in Sect. 4. Finally, in Sect. 5 we discuss our findings in the light of the
physical and chemical processes involved in new particle formation, and draw some
general conclusions.
5
10
2.Experimental
15
2.1.Sampling site
Measurements used in this study were performed during the years 1996–2003 at the
SMEAR II station, which is located in the Hyyti¨ al¨ a Forestry Field Station of the Uni-
versity of Helsinki between Tampere and Jyv¨ askyl¨ a in southern Finland (61◦51?N,
24◦17?E, 180m a.s.l.). The station was designed to study mass and energy flows
in atmosphere-vegetation-soil continuum. Around the station, for about 200m to all
directions, there is a homogeneous 40-year-old Scots pine stand. The dominant stand
height is about 14m and the all-sided needle area is 7m2m−2. Rannik (1998) describes
the micrometeorology of the site.
20
7580
Page 5
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
Back Close
Full Screen / Esc
Print Version
Interactive Discussion
EGU
2.2. Measurements
In this study we used the continuous measurements for concentrations of NO, NOx,
SO2, O3, H2O, CO2and CO, for the number size distribution of aerosol particles (dry
diameter of 3–600nm particles) and for meteorological data, such as temperature,
pressure, wind speed, wind direction, humidity and radiation (UV-A, UV-B, PAR, global,
net, reflected global and reflected PAR). The measurements of gas concentrations and
meteorological data were performed at different heights: levels of 4.2, 8.4, 16.8, 33.6,
50.4 and 67.2m on the measurement tower. The number size distribution of aerosol
particles was measured at 2m height.
Flux measurements (sensible heat, latent heat, momentum, CO2, H2O, O3and
aerosol particles) were carried out in a tower at the height of 23.3 m and partly at
the height of 46.0m using eddy covariance (EC) technique (Suni et al., 2003). Tempo-
ral gaps in the CO2flux measurements were filled using the same method as Aubinet
et al. (2001) and Falge et al. (2001). The details of the measurements performed
continuously at the SMEAR II station can be found in (Vesala et al., 1998).
5
10
15
2.2.1.Condensation sink
The ambient aerosol population acts as a sink for other atmospheric constituents by
serving as a condensation surface for low-volatility vapours and by scavenging ultra-
fine aerosol particles by coagulation. To quantify these processes, we can calculate
the condensation sink caused by the aerosol population (see for example (Pirjola and
Kulmala, 1998)):
?∞
i
Here Dpidescribes the diameter of the particle in the size class i and Niis the par-
ticle number concentration in the respective size class. D is the diffusion coefficient
of the condensing vapour, and βmthe correction factor for the transition and the free
7581
20
CS = 2πD
0
Dpβm(Dp)n(Dp)dDp= 2πD
?
βiDpiNi.
25
Page 6
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
molecular regimes (Fuchs and Sutugin, 1970). The condensation sink serves as an
approximation of the coagulation sink, as it behaves identically, differing only in magni-
tude. Because the ambient aerosol particle size distribution in Hyyti¨ al¨ a was measured
using a Differential Mobility Particle Sizer (DMPS) at low relative humidities and thus in
a dry state, the hygroscopic growth factor was taken into account by using the param-
eterization by Laakso et al. (2004), so that the calculated sink corresponds to ambient
RH conditions. Thus, the condensation sink depends on the particle size distribution,
temperature (via the diffusion coefficient) and RH. The RH dependency of the conden-
sation sink is stronger than the temperature effect.
5
2.2.2.Event classification
10
To distinguish between days with new particle formation and days with no particle for-
mation we used a database created by Dal Maso et al. (2005)1. The database was
created by visual inspection of the continuously measured aerosol size distributions
over a size range of 3–600nm in Hyyti¨ al¨ a. Days displaying a growing new mode in the
nucleation size range prevailing over several hours were classified as event days. Days
which were clear of all traces of particle formation were classified as non-event days.
Days which could not unambiguously be classified as either event or non-event days
were termed “undefined” days, and removed from the data pool used in this study.
15
3.Computational methods
The data mining methods that have been applied in this study are widely used ones. In
this section we briefly describe each method used, but for details we refer the reader
to e.g. Hand et al. (2001); Hastie et al. (2001).
20
1Dal Maso, M., Kulmala, M., Riipinen, I., Wagner, R., Hussein, T., Aalto, P. P., and Lehtinen,
K. E. J.: Formation and Growth of Fresh Atmospheric Aerosols: Eight Years of Aerosol Size
Distribution Data from SMEAR II, Hyytiala, Finland, submitted to Boreal Env. Res., 2005.
7582
Page 7
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
Back Close
Full Screen / Esc
Print Version
Interactive Discussion
EGU
The computations were done on Matlab (Moler, 2004). In some cases the Statistics
Toolbox was used.
3.1.Preprocessing of data
The raw datasets obtained display very fragmented time series, 8 years of measure-
ments every 30min, with a large number of missing values. Using this large data set,
we calculated for each day the mean and standard deviation of each variable in a
chosen time window. The mean and standard deviation were only calculated if there
are more than 5 measured values in the appropriate window. Otherwise the values
on that day were declared as missing. We chose to exclude each variable with more
than 800 missing days (this includes particle flux and CO measurements) and after
that any day with any missing variable. We also chose to exclude the latent heat flux
measurements, as their correlation with water vapour flux measurements is one.
The above treetop mast measurements were averaged to one variable (hi) and the
below treetop measurements to another (lo). As these correlate strongly, we have
frequently only included above treetop averages.
Before calculations the data was normalized so that each variable has zero mean
and unit variance. The purpose of normalization is to make sure that all variables
are of equal weight. Otherwise, when comparing days, variables with large numerical
values will appear as more important.
After preprocessing and removal of undefined days we have around 500 days,
roughly half of which are event days, and around 60 variables. The data set consists
of the measurements shown in Table 1.
5
10
15
20
3.1.1. Selection of time window
It is not reasonable to calculate daily means and standard deviations of the variables for
the whole 24h, since in boreal regions such as Hyyti¨ al¨ a at 61deg North the day length
depends strongly on time of the year. Thus, for example, the fixed time window from
25
7583
Page 8
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
04:00 am to 04:00 pm includes lots of non-daylight hours in the winter. The window
of fixed length of 6h starting at sunrise includes the whole day in midwinter and just
the early morning hours (04:00–10:00) in midsummer. These, among several other
time windows have been tested in the course of this work to obtain the most useful
parameters for nucleation. All time windows cover the late morning hours, because
this is the time nucleation usually occurs. Because of the variations in the length of the
day, the window from sunrise to sunset seems a reasonable choice, and indeed it has
the best classification performance (data not shown). We thus present the results for
this window only. Selecting this window instead of one covering mainly hours preceding
the usual nucleation occurrence time means our results are likely to reflect more on the
conditions under which aerosol particles keep growing rather than on factors initiating
nucleation.
5
10
3.2.Clustering
In trying to understand what causes nucleation events a reasonable first approach is to
cluster the days. In clustering one aims to divide the data into a number of clusters in
such a way, that data points (here days) in the same cluster are similar to each other,
while data points in different clusters are dissimilar. A widely used clustering method is
the K-means algorithm (MacQueen, 1967). In the basic version one starts by picking
randomly K cluster centers. One then repeatedly assigns to each cluster all points
closest to the cluster center, and recomputes the new cluster center as the mean of
all points in that cluster. This is done until no changes in the centers occur. The most
commonly used distance measure between points is the Euclidean distance.
When using K-means one first has to normalize the data and remove colinearities,
otherwise variables with large numerical values or strong correlations will dominate
the performance of the clustering algorithm. Elimination of correlations can be done
using principal components analysis as a preprocessing step. Principal components
analysis (PCA) uses singular value decomposition (SVD) on the centered data matrix
to find mutually orthogonal linear combinations of the original variables in such a way
7584
15
20
25
Page 9
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
that variance of the original data is preserved as well as possible (Pearson, 1901). In
many cases the variance captured by the last principal components is very small, and
they can be left out. One can project the data onto the first few principal components,
renormalize and do the clustering for this new data matrix. For this data the clustering
done using the first six principal components, which capture 70% of the variance in
the data, resembles the clustering done on the original data matrix after a few strongly
correlating variables are removed, so we present the results for the original data only.
From the data used in clustering we have left out all radiation measurements except
global radiation, as all of these correlate strongly.
There are several methods for choosing the number of clusters K. We have used the
Davies-Bouldin index (Davies and Bouldin, 1979). It is a function of the ratio of the sum
of within-cluster variation to between cluster separation, and therefore favors compact
and well separated clusters.
5
10
3.3. Classification methods
An alternative approach to understand the occurrence of events is to consider the
setting as a classification problem: we want to use the data to classify each day as
an event day or a nonevent day. In fact, we are not really interested in separating
event days from nonevent days, but in understanding which variables one should use
to separate the two groups.
A standard approach in estimating the performance of classification methods is to
use cross-validation. The data is repeatedly split into two independent sets, one of
which is used as the training set to fit the model in question, and the other is used as
the test set to to obtain an unbiased estimate for the classification error.
We evaluate the performance of the methods by computing the misclassification rate:
15
20
error=Nmissed+ Nfalse
Ntotal
· 100%,
25
where Nmissedis the number of event days classified as nonevents, Nfalseis the num-
7585
Page 10
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
ber of nonevent days classified as event days, and Ntotalis the total number of days
classified. After cross-validation we report the average misclassification rate together
with 95% confidence intervals. We frequently also list the proportion of missed events
and false events:
missed =Nmissed
Ntotal
Ntotal
Most classification methods require all classes to have approximately of the same num-
ber of cases.
· 100%,
false =Nfalse
· 100%.
5
3.3.1.Linear methods for classification
For an important class of classification methods the boundaries separating the objects
to be classified are linear. There are a number of methods to find a linear separating
hyperplane. We briefly describe some of them. For more details see e.g. the reference
mentioned earlier.
In Linear Discriminant Analysis (LDA) the goal is to find a set of linear combinations
of the original variables so that when the data is projected onto the subspace spanned
by these vectors the within-class scatter is minimized and the between-class scatter is
maximized. Such linear combinations are called linear discriminants. In a two-class
case such as ours we only look for one linear discriminant. The first linear discriminant
is the normal of the hyperplane separating the two classes. It therefore also tells how
event days are separated from nonevent days. LDA is closely related to multivariate
analysis of variance (MANOVA).
One can use LDA for fitting quadratic boundaries by adding the second order terms
to the data matrix. For example, in the two variable case we add to the variables x
and y the second order terms x2, y2and xy. We then do LDA in this five-dimensional
space with coordinates (x, y, x2, y2, xy) instead of the original two-dimensional one
with coordinates (x, y). We shall refer to this method as LDAQ.
10
15
20
25
7586
Page 11
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
Linear regression in turn predicts the output y via a linear model
y = β0+
n
?
j=1
βjxj,
where x=(xj)n
titative outputs, but it can be used for classification tasks too. In the classification case
we define y to be one for event days and zero for nonevent days, and fit the regression
model accordingly. Our input data consists of the measurement vectors for each day.
Logistic regression belongs to generalized linear models. Here we want to formulate
a model for the probability that the output y is 1 given the input x: p(y=1|x). We could
use a linear model for this, but this is not ideal. For example, a linear model can take
values outside the interval [0,1], which are not meaningful. Instead, we modify the
model by transforming the probability nonlinearly so that it can be modeled by a linear
combination. In logistic regression this nonlinearity is the logistic function:
j=1is our n−dimensional input data. This is usually used to predict quan-
5
10
log
p(y = 1|x)
1 − p(y = 1|x),
which is modeled linearly, i.e.
log(p/(1 − p)) = β0+
n
?
j=1
βjxj.
15
Support vector machines (SVM) belong to kernel methods, in which the idea is to
map the original data (usually nonlinearly) into a (higher dimensional) feature space
and do e.g. classification there (Shawe-Taylor and Christianini, 2004). When using
a linear kernel this method falls into the category of linear methods. In this case we
lose some of the potential of the method, but we are able to keep track of the vari-
ables. Sacrificing linearity (which in any case is probably too strict an assumption in
our case) we have a choice of a wide variety of kernels. Most commonly used ones
7587
20
Page 12
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
include polynomial kernels and RBF (radial basis function) kernels. Polynomial kernels
of degree two have been tried out in our study, but since the results for a wide variety of
parameter choices were constantly worse than for linear kernels, the results for these
are omitted. We have used the LS-SVM Toolbox for Matlab (Pelckmans et al., 2003).
3.3.2.Other classification methods
5
With the SVMs we already moved out of the realm of linear methods. Here we describe
two other nonlinear classification methods that have been used.
K-nearest neighbor classification takes a point in the test set, compares it with all the
points in the training set, and decides the class by looking at the class of the K nearest
neighbors of the point. In our case, for K=10, the event status of a day in the test set is
decided by looking at the event status of the 10 days most closely resembling the day
under inspection. This gives us a feel for how close the event days are to each other.
However, we do not gain information about in what aspects the event days are similar
to eachother.
Classification trees (Breiman et al., 1984) partition the feature space into a set of
rectangles, and then assign a constant class in each one. We first split the space
into two regions, and assign a class to each one. The variable and the split-point are
selected to minimize classification error. Then both regions are split into two more
regions, and this process is continued until some stopping criterion is applied. This
can be visualized as a tree, see Fig. 1. The topmost variable (RH) is the single variable
with the best classification performance. On the left branch of the tree we have the
condensation sink. This is the best variable (in terms of classification performance) in
distinguishing event days from nonevent ones in the half plane RH<77. Compare this
to Fig. 4. The leftmost branch of the tree presented in Fig. 1 corresponds to the lower
left corner of Fig. 4.
10
15
20
25
7588
Page 13
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
3.3.3. Feature selection
A simple approach to gain insight on the importance of different variables in explaining
events is to take all pairs of variables and see how well the days are classified as event
or nonevent days on the basis of the values of each pair. The same can be done
for each triplet of variables, but beyond that the complexity of the problem makes this
approach impractical.
Of course, it is hardly likely to find a satisfactory explanation for such a complex
phenomenon by just using two or three variables. An alternative is to use a stepwise
approach (Hand et al., 2001). In doing stepwise forward selection of variables we start
with the variable which gives the best classification result by itself, and on each step
add the variable which results in the best classification. In doing stepwise backward
selection of variables, we start with all variables, and on each step leave out one vari-
able, chosen so that the classification result is optimized. In forward selection there
is the risk that the combined effect of some set of variables is missed. In backwards
selection it is possible that we discard a significant variable at an early stage. For our
data backwards selection performed poorly, so the results are omitted.
A tempting approach is to look at the weights given by linear regression for each
variable, or the normal of the separating hyperplane in the case of linear discriminant
analysis. One could argue that these tell about the relative importance of the variables.
This, however, is not true when there are strongly correlating variables so one should
only use this approach with extreme caution: for our data set it was not applicable.
5
10
15
20
4.Results
4.1.Clustering
We used K-means clustering to cluster the days into four clusters. The results are
very good: the algorithm does not use event information for clustering, yet it produces
25
7589
Page 14
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
clusters with very few event days, as well as one with over 90% event days.
The temporal distribution of these days is presented in Fig. 2. Note the temporal
cohesion of the clusters, even though the calendar time is not used in the clustering.
Cluster 1 consists almost solely of event days, whereas clusters 3 and 4 have almost
no events. From top to bottom, the counts for days and event days for each cluster are
presented in Table 2.
We observe four robust clusters: spring&fall days (cluster 1), summer days (clus-
ter 2), cloudy days (cluster 3) and polluted days (cluster 4). The names describing
clusters 3 and 4 are derived by looking at the cluster centers of these clusters. The
cluster centers, describing the typical values of each variable in each cluster, are pre-
sented in Fig. 3. One can see that the best parameters to separate the event clusters
(1 and 2) from non-event clusters (3 and 4) are relative humidity, global radiation and
sensible heat. Also the mean of ozone and carbon dioxide concentrations have a sep-
aration power. Most of the event days fall in to clusters 1 and 2. The main difference
between these clusters is the time of the year and the related physical parameters. The
summer days in cluster 2 have higher temperatures along with an elevated concentra-
tion of water and higher daily variability of CO2, O3and H2O concentrations. Also the
CO2and H2O fluxes differ in clusters 1 and 2. The condensation sink has low values
in the cluster with most of the events.
5
10
15
4.2. Results using classification methods
20
The main result given by the wide range of classification methods used is that the most
important variables in explaining the nucleation events are the means of the relative
humidity (RH) and the logarithm of the condensation sink. This is supported by a
number of different approaches.
– When fitting a decision tree to the data, these are the top two variables selected
in most cases. Moreover, on the test set the tree involving only these variables
(see Fig. 1) performs frequently as well as more complicated trees, which tend to
25
7590
Page 15
ACPD
5, 7577–7611, 2005
A look at aerosol
formation using data
mining techniques
S. Hyv¨ onen and
ATMDM team
Title Page
Abstract
Introduction
Conclusions
References
Tables
Figures
??
??
BackClose
Full Screen / Esc
Print Version
Interactive Discussion
EGU
overfit.
– These are also the two first variables selected when doing forward stepwise se-
lection of variables using any of the linear methods.
– These two variables form the best pair of variables. They also are almost always
included among the best three variables. The best pairs were sought after using
both linear regression and linear discriminant analysis. For the best triplets, only
linear regression was used.
5
The performance of a number of methods using only RH and the logarithm of the
condensation sink is summarized in Table 3. Each method was run 1000 times using
different training and test sets, and the average percentage of errors and 95% confi-
dence intervals for the errors were computed.
In Table 4 we have summarized the performance of a few of the top ranking pairs
using LDA. We see that RH and the condensation sink have the best performance. The
other methods yield similar results.
This can be compared to the performance of a few of the top ranking triplets using
linear regression, summarized in Table 5. It is evident that there is no “best triplet” as
the 95% confidence intervals of all of these overlap. In fact, for 127 triplets the 95%
confidence intervals overlap with that of the best ranked one, topmost in this table, and
80 of these have confidence intervals which overlap that of the best pair; not one triplet
is clearly better than the best pair.
We have demonstrated above that relative humidity and the condensation sink are
the most significant variables explaining the nucleation events. All of the linear clas-
sification methods had an error rate of approximately 12% when using only these two
variables. It seems reasonable to expect, that adding variables to the model would
improve classification results. But here we run into the problem demonstrated by the
best triplets: there are too many choices of variables with equal performance.
When using stepwise addition of variables together with any of the classification
methods, different runs (using different training sets) yield different sets of variables
7591
10
15
20
25
View other sources
Hide other sources
-
Available from Miikka Dal Maso · 19 Oct 2012
-
Available from psu.edu