ArticlePDF Available

Abstract and Figures

Air pollution in large cities produces numerous diseases and even millions of deaths annually according to the World Health Organization. Pollen exposure is related to allergic diseases, which makes its prediction a valuable tool to assess the risk level to aeroallergens. However, airborne pollen concentrations are difficult to predict due to the inherent complexity of the relationships among both biotic and environmental variables. In this work, a stochastic approach based on supervised machine learning algorithms was performed to forecast the daily Olea pollen concentrations in the Community of Madrid, central Spain, from 1993 to 2018. Firstly, individual Light Gradient Boosting Machine (LightGBM) and artificial neural network (ANN) models were applied to predict the day of the year (DOY) when the peak of the pollen season occurs, resulting the estimated average peak date 149.1 ± 9.3 and 150.1 ± 10.8 DOY for LightGBM and ANN, respectively, close to the observed value (148.8 ± 9.8). Secondly, the daily pollen concentrations during the entire pollen season have been calculated using an ensemble of two-step GAM followed by LightGBM and ANN. The results of the prediction of daily pollen concentrations showed a coefficient of determination (r2) above 0.75 (goodness of the model following cross-validation). The predictors included in the ensemble models were meteorological variables, phenological metrics, specific site-characteristics, and preceding pollen concentrations. The models are state-of-the-art in machine learning and their potential has been shown to be used and deployed to understand and to predict the pollen risk levels during the main olive pollen season.
Content may be subject to copyright.
Predicting the Olea pollen concentration with a machine learning
algorithm ensemble
José María Cordero
&J. Rojo
&A. Montserrat Gutiérrez-Bustillo
&Adolfo Narros
&Rafael Borge
Received: 11 May 2020 /Revised: 14 September 2020 /Accepted: 31 October 2020
#ISB 2020
Air pollution in large cities produces numerous diseases and even millions of deaths annually according to the World Health
Organization. Pollen exposure is related to allergic diseases, which makes its prediction a valuable tool to assess the risk level to
aeroallergens. However, airborne pollen concentrations are difficult to predict due to the inherent complexity of the relationships
among both biotic and environmental variables. In this work, a stochastic approach based on supervised machine learning
algorithms was performed to forecast the daily Olea pollen concentrations in the Community of Madrid, central Spain, from
1993 to 2018. Firstly, individual Light Gradient Boosting Machine (LightGBM) and artificial neural network (ANN) models
were applied to predict theday of the year (DOY) when the peak of the pollen season occurs, resulting the estimated average peak
date 149.1 ± 9.3 and 150.1 ± 10.8 DOY for LightGBM and ANN, respectively, close to the observed value (148.8 ± 9.8).
Secondly, the daily pollen concentrations during the entire pollen season have been calculated using an ensemble of two-step
GAM followed by LightGBM and ANN. The results of the prediction of daily pollen concentrations showed a coefficient of
determination (r
) above 0.75 (goodness of the model following cross-validation). The predictors included in the ensemble
models were meteorological variables, phenological metrics, specific site-characteristics, and preceding pollen concentrations.
The models are state-of-the-art in machine learning and their potential has been shown to be used and deployed to understand and
to predict the pollen risk levels during the main olive pollen season.
Keywords Air quality .Pollen exposure .Pollen prediction .Neural networks .Boosted trees
Poor air quality is associated to mortality and morbidity through
respiratory causes, including cardiovascular diseases and lung
cancer (Burnett et al. 2018; Cole-Hunter et al. 2018). According
to the World Health Organization (WHO), 4.2 million prema-
ture deaths worldwide every year can be attributed to air pollu-
tion (World Health Organization,2019). Moreover, the
contribution of air pollution to premature mortality could dou-
ble by 2050 (Lelieveld et al. 2015). Over 500,000 premature
annual deaths are attributed to population exposure to PM
in Europe (EEA 2019a). In this sense, there is a
clear need to improve urban air quality and foster climatic co-
benefits (Balbus et al. 2014;Xieetal.2018) in agreement with
the air quality standards of the Air Quality Directive (Directive
2008/50/EC of the European Parliament and of the Council of
21 May 2008 on ambient air quality and cleaner air for
Europe, 2008) (AQD) (EEA 2019b).
In addition to abiotic air pollutants (particulate inorganic
matter, NO
, etc.), the interest of the air quality field
is increasing on biological particles which induce an important
risk for human health (Lake et al. 2017). According to the
World Allergy Organization, allergies are considered one of
the most relevant public health problems in this century
(DAmato et al. 2015;Pawankaretal.2013). Therefore, nu-
merous medical studies have quantified hospital admissions as
a consequence of allergenic reactions to high pollen levels in
the preceding days (Diaz et al. 2007; Galan et al. 2010;Thien
*José María Cordero
Universidad Politécnica de Madrid (UPM). ETSII-UPM, José
Gutiérrez Abascal 2, 28006 Madrid, Spain
University of Castilla-La Mancha. Institute of Environmental
Sciences (Botany), Avda. Carlos III s/n, E-45071 Toledo, Spain
Department of Pharmacology, Pharmacognosy and Botany,
Complutense University of Madrid, Ciudad Universitaria,
28040 Madrid, Spain
International Journal of Biometeorology
et al. 2018). With this background, the interest in the monitor-
ing of airborne biological particles has increased during the
last decades, not only from an allergological point of view but
also with agricultural and ecological purposes (Beggs et al.
2017;Šantl-Temkiv et al. 2019).
The prediction of Olea pollen concentrations during the
pollen season is a valuable tool to assess the risk level and to
warn allergic sufferers to pollen.
The short-term predictive modelling in olive pollen is very
complex and implies plenty of difficulties because (i) yearly
flower production of olive trees depends on the environmental
conditions during the current and previous year, as well as on
the internal hormonal and biochemical balance of the plant
(Rojo et al. 2015); and (ii) the variability associated to local
features such as the topography or altitude on the succession
of olive flowering characteristics (Aguilera and Ruiz
Valenzuela 2012;Oterosetal.2013; Rojo and Perez-Badia
2014). In addition, a strongvariability among years and also in
consecutive days in the maximum pollen concentration has
been reported (Fernandez-Rodriguez et al. 2016a; Perez-
Badia et al. 2013). Furthermore, autoregressive effects of the
temperature or precipitation occurring the previous days influ-
ence on pollen concentration (Silva-Palacios et al. 2016).
The modelling approaches of main pollen season (MPS)
(Galán et al. 2017)ofOlea pollen and other pollen types have
often faced this issue by estimating separately both phenolog-
ical metrics (start and peak dates and length of the pollen
season) and pollen intensity parameters (annual amounts and
daily concentrations) (Fernandez-Rodriguez et al. 2016b). A
number of works analysed the importance of accumulated
temperature and precipitation previously to the flowering
stage (Rojo and Perez-Badia 2015), as well as thermal require-
ments based on the chilling and forcing accumulated periods
of the olive tree (Aguilera et al. 2014; Chuine et al. 1999;
Fraga et al. 2019; Orlandi et al. 2013). In addition, predictive
models for forecasting daily pollen concentrations were devel-
oped taking into account the meteorological variables and
pollen concentration of the preceding days using statistical-
based procedures from linear regression models (Cotos-Yanez
et al. 2004) to neural network models (Iglesias-Otero et al.
2015). Also, the historical time series of pollen data have been
studied to predict future pollen levels based on the inherent
seasonal behaviour of the pollen season in conjunction with
current meteorological conditions (Rojo et al. 2017).
In this work, we use models that have not been implemented
before in pollen prediction as they are state-of-the-art in machine
learning technology. LightGBM (light gradient boosting ma-
chine) as well as ANNs (artificial neural networks) are able to
capture non-linearities and to make a deep learning of the rela-
tionships among the predicting variables. Firstly, LightGBM and
ANNs are applied to predict the peak of the MPS. Secondly, a
supervised machine learning ensembling process that involves
two steps of generalized additive models (GAMs) followed by
LightGBM on one side and ANNs on the other side, was pro-
posed to predict the Olea pollen concentration. The phenology of
the MPS was accounted for using the day of the year (DOY) as
numeric input to the models (start and peak dates). The accumu-
lated heat requirements necessary for olive flowering were ad-
dressed by introducing a forcing requirements variable (state of
forcing, Sf). The effect of the time series was accounted for
introducing lags of 3 and 5 days of Olea pollen concentrations.
The measurement locations and their meteorology were also
used as features. The main objective reached in this work was
to develop predicting models for daily pollen concentrations and
periods of the pollen season of Olea pollen. The results of this
forecasting model would allow the health authorities to anticipate
the adequate measures to minimize the impact on the population.
Materials and methods
Pollen data
The PALINOCAM Network belonging to the Community of
Madrid has been measuring the pollen concentrations since
1993 recording an extensive database involving 25 morpholog-
ical pollen types based on daily measurements. From all the
pollen types, the Poaceae (=Gramineae) (Grass), Olea (Olive),
Cupressaceae/Taxaceae (Cypress/Arizonica), and Platanus
(London plane) have a special interest in the Mediterranean re-
gion due to their impact in human health and their relative abun-
dance in the air (DAmato et al. 2007;Perez-Badiaetal.2010).
Olive pollen is one of the most important aeroallergens moni-
tored by the PALINOCAM Network (Community of Madrid,
central Spain).
The study includes the analysis of 22 years of data. The period
19972017 was employed to train the model, and the remaining
2018 was used for external validation. It includes up to 10 pollen
sampling points where Olea pollen was measured (Fig. 1). These
locations belong to the aerobiological network of Madrid
(PALINOCAM,2019). The pollen sampling was done using
10 Hirst-type volumetric spore-trap according to the guidelines
from the international standardized methodological procedure
agreed by the main scientific aerobiological organizations in
Europe (Galan et al. 2014). The units of the daily pollen concen-
trations were number of pollen grains per cubic meter of air
(pollen grains/m
The Olea_3_day and Olea_5_day, respectively, were calcu-
lated for the daily pollen concentration 3 and 5 days before.
Taking them as lag predictors allows including an
autoregressive effect in the time series. Furthermore, to account
for the seasonality component of the time series, the day of the
year (DOY) was also included as feature (yearday): considering
day one as 1st January. This variable indicates the moment of
the year in the model, which is directly related to the strongly
seasonality of pollen season. In this way, all components of
Int J Biometeorol
classical time series analysis (seasonality and trend) are as-
sumed in the fitted models. Note that the objective of this work
was to predict if a pollen peak is going to occur with some days
in advance. The Olea pollen concentration is daily available, so
preceding days of the Olea concentration information can be
accessed on-line and used in operative predictive models.
Meteorological data
The meteorological data were obtained from the Open Data API
from the AEMET (Open Data,2019). The meteorological var-
iables considered were maximum temperature (tmax), minimum
temperature (tmin), average temperature (tmed), wind speed
(Ws), wind direction (Wd), and sunshine hours (Sun).
Phenological variables and thermal requirements
There are references defining with precision the timing dates
related to the MPS (Bastl et al. 2018; Galán et al. 2017;Pfaar
et al. 2018). The present study has used an adaptation of Pfaar
et al. (2018): The start_PS has been taken as the first DOY
when the Olea pollen concentration was above 20 grains/m
whereas the other phenological variables (high_PS and
peak_PS) were taken exactly as indicated in such reference.
The start_PS has been introduced in the models as binary
variable yielding 1 once the MPS starts. The high_PS variable
was also introduced as binary, yielding 1 when the daily pol-
len concentration was above 100 grains/m
. The same was
done for Peak_PS, corresponding to the DOY when maxi-
mum Olea pollen concentration is reached.
Modelling of the phenological variables was conducted
using a thermal-based approach using the state of forcing heat
(Sf). The Sf necessary to initiate the bud growth has been
widely applied to determine the start of the pollen season
(Chuine et al. 1998; Osborne et al. 2000; Picornell et al.
2019). Sf is expressed as forcing units, which are non-dimen-
sional, and can be understood as a mathematical transforma-
tion of heat (Chuine et al. 1999). The following equations
Fig. 1 Location of the monitoring sites in the Community of Madrid (extent highlighted in red). Green triangles for the PALINOCAM pollen stations,
blue squares for meteorological stations
Int J Biometeorol
have been commonly used for calculating the state of forcing
heat (Sf):
Sf ¼0Tavg Tb
i¼1Tavg Tb
Tavg Tb
Sf ¼n
! ð2Þ
where nis the number of days of the year.
Where dis a numeric parameter with negative values, T
the daily averaged temperature, T
the temperature threshold. According to the equations from
above, a day when T
is under the temperature threshold c
would have a nearly null contribution to the sum of Sf.
Inversely, if T
is above the temperature threshold, that day
would have a high contribution to the Sf. The base tempera-
ture can be arbitrarily adjusted (Chuine et al. 1998), i.e. taken
as fixed parameter. Just the cand dconstants would be differ-
ent for different T
. Other method is to consider it as the
temperature above which the heat forcing starts affecting the
pollen production (Osborne et al. 2000).
To fit the parameters T
,d,andc, the following methodol-
ogy was applied: Firstly, T
was computed as the average
temperature calculated among all the locations and years for
the entire time series without the validation data (year 2018
excluded), a week previous to the start of the pollen season
(start_PS). This T
that resulted to be 18.4 °C was a good
proxy for the range of temperatures in the week previously
to the flowering and developing of the anthers. Then, the
initial Sf was calculated according to Eq. (1).
In the next step, a non-linear least squares algorithm based on
Gauss-Newton (Hartley 1961) was fitted to the data, resulting in
values for the parameters cand dof, respectively, 18.89 and
6.12 °C. The mean absolute error (MAE) between the Sf calcu-
lated with the fitted non-linear model (Eq. 2) and predictions (Eq.
1) was computed. When the Sf was finally computed, the mean
Sf which marked the start_PS could be calculated for all stations
and years. We will refer to this Sf as critic Sf, since by itself can
be used to estimate the day of the year (DOY) when the pollen
season starts (start_PS), and hence to predict the start of the
pollen season given in a site and its past conditions taking into
account only the daily averaged temperatures. For our case and
as average among locations and years, it yielded 3.3 forcing
units. The fact of Sf capability to mark the start of the pollen
season provides valuable information for the predicting models.
Site-dependent variable (locations)
We have introduced a location feature as a categorical variable
that takes 10 possible values (one for each pollen station). Then,
it was transformed to 10 binary dummy variables. This last step
is necessary for linear models (ANNs) but could be skipped for
tree-based models due to their intrinsic non-linearity. Although
the models have been applied to the Community of Madrid, they
can be easily extrapolated to other locations and other palyno-
logical species. The user only has to code its locations as dummy
variables and to fit the proposed algorithm.
In this work, two separated lines of modelling algorithms were
developed: (1) in the first step (line 1), a LightGBM and an ANN
were used to predict the peak_PS DOYs; (2) in the second step
(line 2), an ensemble combining two previous GAM steps with a
LightGBMandanANNwasemployedtopredicttheOlea pol-
len concentration during the whole MPS. In each case, using a
separate LightGBM and ANN in parallel allowed to compare the
results from both models (Fig. 2).
ThemodelsemployedinFig.2are briefly described in the
following paragraphs. The GAMs (Hastie 2018, James et al.
2013) used as first layers in the calculus of the Olea pollen
concentrations offer the possibility to be easily approximated
by polynomials. They are non-parametric, so no assumptions
about the normality of the data have to be made. In addition, they
capture well the time trends and the user can fit a different func-
tion for each feature, linear or non-linear, so their flexibility is
huge. GAMs thus allow a more general model to be fitted than
linear models and they have been found useful in other areas of
atmospheric research (Borge et al. 2019). The functions chosen
were splines since they are non-linear smoothers.
The number of degrees of freedom is a hyper-parameter of
spline-based GAMs. Therefore, a grid search was done for
optimizing the number of degrees of freedom of the splines
between 1 and 20, whereas the root mean squared error
(RMSE) was used as the default cross-validation error score
(where a number of 10 folds was chosen). The R package
Caret was used for fitting GAMs (Kuhn 2017). The resulting
repeated cross-validation RMSE is shown in Fig. 3.
The optimum was not reached up to 20 degrees of freedom,
but from that point, the model complexity is intractable. An
intermediate number of degrees of freedom was set to 10 to
avoid overfitting.
The LightGMB algorithm is by itself an ensembling meth-
od, since it combines several boosted trees to reach the pre-
dictions. Obviously, with this technique, as well as with
ANNs, the focus was the improvement of the predictions at
the expenses of the interpretability of the results, because they
behave as black boxes.
For the case of ANNs, several architectures were analysed,
and the one which performed best was chosen. This architecture
consisted of an input layer of 20 units, a hidden layer with
50 units, and one output layer. The activation functions were,
respectively, sigmoid, sigmoid, and linear, resulting from a grid
search testing diverse hyper-parameters. This algorithm was
Int J Biometeorol
described several years ago (HOPFIELD 1982), and its imple-
mentation using TensorFlow 2.0 (released on September, 2019)
constitutes a novelty in a rapidly evolving technology.
LightGBM and ANNs are state-of-the-artmachine learning
models and are currently among the best and most used algo-
rithms (Kaggle 2019). Their application to the case of Olea
pollen, in contrast to other traditional algorithms mentioned in
the introduction, is another novelty of this work.
The data pre-processing and calculus of the estimated vari-
ables related to the pollen season was performed using R in
RStudio Team (2015)andthespecificAeRobiologyRpack-
age for aerobiological tasks (Rojo et al. 2019). The LightGBM
(Ke et al. 2017) was coded in Python in Spyder IDE using the
library of the same name, whereas the ANNs were implemented
in Python 3.7 using Tensorflows2.0(Abadietal.2016)withthe
API Keras (Chollet 2015) in the cloud environment provided by
Google Collaboratory running on GPU (Google
Models for the timing of pollen season in DOYs (line 1)
In this first step, only separate LightGBM and ANN models
were used (line 1). They take as features all the variables
shown in Fig. 2, both station and computed data.
Pollen data from the different locations were merged into
one pool dataset. Developing models for each specific place
would have been expectably more accurate, but less applica-
ble in a general way, so a regional approach was preferred
although the particular characteristics of each site were
neglected, e.g. the wind direction may influence the pollen
in a given site if olive groves were located upstream.
Once the models were fitted, they were used to assess the
peak_PS. The observed peak values were also computed tak-
ing into consideration the observed Olea pollen maximum
concentrations. The results were visualized by means of box
and whiskers plots and the differences between predicted and
= 0.71
AIC = 12522
Time of fing = 10 sec
Final Olea concentraon
= 0.12
Time of fing = 35 min
= 0.46
Time of fing = 55 min
= 0.78
= 0.53
AIC = 11423
Time of fing = 1 min
= 0.81
= 0.53
AIC = 10233
Time of fing = 20 sec
10 places
Final Olea concentraon
Line 1: peak_PS predicon models
Line 2: Olea pollen concentraon models
= 0.68
AIC = 12654
Time of fing = 1 min
Fig. 2 Scheme of the machine learning procedure showing the two lines
followed: Line 1 for standalone LightGBM and ANN for predicting the
peak_PS DOYs and line 2 showing the ensemble for predicting the Olea
pollen concentration. The r
* for the final models applied alone)
calculated in the test set (25% of the total data) is also shown
Fig. 3 Repeated cross-validation RMSE vs. the number of degrees of
freedom for spline-based GAMs. The RMSE was calculated for each
given degrees of freedom as an average of 10 folds
Int J Biometeorol
Apr Aug. Apr Aug.
alc alco
Apr Aug. Apr Aug.
Apr Aug. Apr Aug.
Apr Aug.
Apr Aug.
Apr Aug.Apr
. Jun. .Jun.
. Jun. . Jun.
. Jun. .Jun.
. Jun.
. Jun. .Jun. Aug.
Int J Biometeorol
observed DOYs assessed by the RMSE, the mean absolute
error (MAE), and r
metrics. The results are shown in the
Predicting the peak dates and values (line 1)section.
Models for the prediction of daily pollen concentrations
) (line 2)
The utility behind ensemble techniques is that some models cov-
er the weaknesses of others, and reinforce their strengths.
Cordero et al. (2018) used a combination of multiple linear re-
gression and artificial neural networks (ANNs) to process the
signal of Air Quality sensors elsewhere. In this work, another
approach was followed using more sophisticated and state-of-
the-art algorithms.
To reach the optimal architecture, several combinations
were performed to make an ensemble algorithm. The best
combination in terms of accuracy is schematized in Fig. 2(line
2). It combines two generalized additive models (GAMs) with
LightGBM and ANNs. A third GAM did not improve suffi-
ciently further the r
and took a high computational cost, so
introducing more GAM steps was discarded.
The first GAM takes as inputs again all the available variables
showninFig.2,andOlea pollen concentrations as response
variable. The first GAM predicted Olea concentrations were tak-
en as an additional variable and hence as additional input to the
second GAM (metavariable). Note the improvement of the r
(Fig. 2). The essence of ensembling is precisely that the second
models have information about the performance of the first
models given the same set of conditions (variables). Once both
GAM predictions were obtained (metavariables), they were fed
to the LightGBM and the ANNs along with the rest of variables
Validation of the models
All the models were validated internally and externally. For in-
ternal validation, a two-step method was used: (i) the whole
dataset was randomly split into train and test sets, accounting
respectively for 75 and 25% of the data, which is common prac-
tice. The statistics RMSE, MAE, and r
were calculated on the
test dataset (see Fig. 2); (ii) a 10-fold cross-validation was per-
formed and the results were expressed with the same statistics
from before and the corresponding standard deviation among
folds. This two-step approach allowed us to make sure of the
robustness of the models.
The external validation was performed computing the r
RMSE, and MAE, and using the year 2018 for the ten loca-
tions, which involves independent data which have not been
included in the training task of the model. We have used data
from April to July 2018 (inclusively) for external validation.
In other words, we restricted the external validation to the
Olea pollen season since it is the important period in this work
(both, observed and predicted values remain around zero out-
side of that period). The standard deviation among locations
was also shown.
The RMSE (Eq. 3) is useful providing information about
large errors since the differences between the estimated and
the observed values are squared previously to take the aver-
age. On the other hand, the MAE (Eq. 4)providesdirectly
related errors, so its interpretability is more straightforward.
The combination of both errors should give a complementary
idea of the errors of the models (both in grains/m
Fig. 4 Average pollen concentrations of Olea pollen for the 10 locations
studied. The orange shadow of colours represents the amplitude
(maximum-minimum) of daily values for the time series 19972018.
Aerobiology R package was used (Rojo et al. 2019)
Reference Olea
Olea, grains/m
2000 2005 2010 2015
Fig. 5 Observed Olea concentration time series (black line) and predicted values using LightGBM (blue lines) or ANN (red line). The data has been
treated as a pool of all the observation stations using daily averages
Int J Biometeorol
MAE ¼1
N, is the number of observations
,istheith observed Olea concentration in grains/m
yi,istheith predicted Olea concentration in grains/m
The Akaike Information Criterion index (AIC) is common-
ly used to compare different models. It assesses the amount of
information that would be lost if a given model was used
taking into account the trade-off between the goodness of fit
and the penalty on the number of variables. AIC can be com-
puted using Eq. 5(Akaike 1974):
AIC ¼NLL þ2kð5Þ
LL is the log-likelihood for the model using the natural
logarithm (e.g. the log of the MSE)
kis the number of variables
The best model will show the lowest AIC.
Finally, the feature importance could be assessed by means
of the package LightGBM (Keras does not currently provide
functions to evaluate it) by plotting the features ordered upside
down by number of splits. The splits measure how many times
each variable was used in the model. The results are given in
the Predicting the daily pollen concentration (line 2)section
(Fig. 9).
Results and discussion
Figure 4shows the time series for Olea pollen concentration
measured at each study site in this work. Pollen concentrations
above zero were only detected within a small period of time,
coinciding with the main pollen season. This situation makes
difficult to develop predictive models because of the marked
seasonality of pollen emission.
Several pollen peaks can be seen in the pollen curve, but
only the maximum for each year will be referred here as the
pollen peak. The different peaks are probably related to stag-
gered flowering due to the different location of the olive or-
chards or site-related features like altitude or other topographi-
cal features (Chuine et al. 1998; Rojo and Perez-Badia 2014). In
addition, the multiple peaks observed in the pollen curve could
also be due to factors involved in the pollen dispersion
Max Olea, grains/m3
Fig. 6 Boxplots of the maximum observed Olea concentrations (peak) (a) and of the day of the year (DOY) when the maximum peak occurs (b) for the
ten pollen stations studied. The predictions from both models LightGBM and ANN are contrasted
Table 1 Models results from the internal and external validation of peak
predictions (line 1)
Ensemble Validation r
LightGBM Internal 0.98 11.7 9.5
ANN Internal 0.96 12.2 8.3
LightGBM External* 0.74 ± 0.09 15.6 ± 10.8 9.2 ± 5.8
ANN External* 0.69 ± 0.05 11.4 ± 8.6 8.6 ± 4.7
*The error metrics and their standard deviations are computed across the
10 locations
Table 2 Models results from the cross and external validation for the
pollen concentration (line 2)
Ensemble Validation r
LightGBM Cross 0.78 ± 0.05 29.27 ± 2.73 9.45 ± 2.55
ANN Cross 0.71 ± 0.09 25.03 ± 2.48 9.35 ± 2.24
LightGBM External 0.63 ± 0.17 39.06 ± 19 14.85 ± .6.34
ANN External 0.56 ± 0.12 49.14 ± 22.17 29.56 ± 16.29
*The error metrics and their standard deviations are computed across the
10 locations
Int J Biometeorol
conditions (proximity to olive groves, or urban architecture
close to the traps) or sudden changes in meteorological condi-
tions (wind conditions or other factors).
Predicting the peak dates and values (line 1)
In line 1 for predicting the peak_PS, a r
of around 0.70 was
obtained for the test set. Figure 5shows the observed and
predicted Olea concentration time series using both models
using all the data as a pool and daily averages. The AICs of
LightGBM and ANN are similar, but slightly smaller for the
former model, indicating a best fit.
A question to note is that the LightGBM/ANN models
seem to capture accurately the timing of the peaks of the
pollen season, but they tend to underpredict observed peak
values. Figure 6a shows the comparison among observed
and estimated maximum peaks concentrations and Fig. 6b
shows the comparison among observed and estimated peak
dates (peak_PS) for the pollen stations studied, using both
LightGBM and ANN statistical models.
The models perform reasonably well at predicting the
peaks, both peak pollen value and peak date (peak_PS) (Fig.
6). The MAE obtained was within the same order of magni-
tude of the one found by other authors (Picornell et al. 2019).
Table 1shows the results of the internal validation for the
peak_PS prediction, and the corresponding values for the ex-
ternal validation (see the Predicting the daily pollen concen-
tration (line 2)section).
Regarding the phenological peak date, the predicted mean
peak date (considering all pollen stations and years) was of
149.1 ± 9.3 days for LightGBM and 150.1 ± 10.8 days for
ANN, very close to the observed mean peak date (148.8 ±
9.8 days).
Additionally, the critic Sf can be used by itself for predicting
the start_PS. The critic Sf that minimized the MAE (as explained
in the Phenological variables and thermal requirementssec-
tion) resulted to be of 3.3 units, with a MAE of 1.9 Sf units (Sf
is dimensionless). Therefore, the start_PS DOY obtained by
using the critic Sf by itself was of 138.2 ± 14.0 days, being the
observed start_PS of 131.2 ± 10.1 days, which are quite similar.
Predicting the daily pollen concentration (line 2)
In this section, the ensemble shown in Fig. 2has been trained and
the results were assessed. Table 2shows the results from the 10-
fold cross-validation as mean values followed by their standard
deviation. The results are comparable and in the order of magni-
tude of (Iglesias-Otero et al. 2015;Laraetal.2019). In addition,
Table 2shows the results from the external validation as mean
values followed by their standard deviation by location and the
year 2018 that had not been included in the previously used data
training time series. By algorithms, r
for LightGBM were slight-
ly better than the ones for ANNs (Table 2), but the RMSE and
MAE were a little higher in the cross-validation, and the opposite
in the external validation. The calculated AICs are analogous to
the r
and to the findings of the Predicting the peak dates and
values (line 1)section.
In addition, we have compared these r
with that obtained
when a naïve model was applied: using the Olea pollen con-
centration from previous day as the predicted concentration.
The r
was of 0.48 and the AIC of 18,251 indicating that
having only one variable does not compensate the loose of
accuracy. The poor accuracy of such a model is on not includ-
ing the autoregressive component (in this study, as Olea pol-
len lags) and a seasonality effect (included in this study by the
yearday variable). Note also that a high part of the variance is
induced by short-term meteorological effects.
The results show that the ensembles used were able to
predict the period when the MPS begins. In addition,
peak_PS are predicted very accurately with the exceptions
of locations like alc, get, or roz, where the models seem to
predict earlier lower peaks. However, some difficulties are
presented to capture the exact moment when the different
peaks occur in some sites like ciu or vil (Fig. 7). This result
may be due to tree-related variables not included in the
statistical model. Information about olive pollen biology is
a very complex field out of the aims of this work (Chuine
et al. 1998). Other variability sources such intrapopulation
phenological variability may explain why the external val-
idation can be so complex (Rojo and Pérez-Badia., 2015).
The regional model using the pool dataset allows to gener-
ate more general forecasting models and hence represent
better the overall olive crops in the Community of Madrid.
Oteros et al. (2013) supported that local particularities could
improve the predictions for particular locations, although
the applicability of the model shrinks.
The models are then ready to predict Olea pollen concen-
trations and the peak dates of the pollen season. They also will
constitute the future basis to study other pollen species.
Finally, some plots were made to help visualizing the over-
all results. Figure 8shows measured Olea concentration vs.
that calculated by means of the LightGBM ensemble (A), and
the ANN ensemble (B), whereas Fig. 9shows the LightGBM
feature importance plot.
The most important features according to Fig. 9are the
time series lags for the daily pollen concentrations, in agree-
ment with Lara et al. (2019). Then, a block of temperature
features including the Sf can be identified. Next variables
are wind and the predictions from both GAMs, which is con-
sistent with the methodology of including GAMs in the pro-
posed ensemble. The features start_PS and peak_PS may
seem unimportant; however, they are included indirectly in
the variable yearday, which indicates that phenological data
is relevant to the model. The site features remain as the less
important ones, pointing out that the choice of a regional
model was appropriate.
Int J Biometeorol
R2 = 0.59
R2 = 0.51
R2 = 0.70
R2 = 0.66
ara bar
R2 = 0.35
R2 = 0.57
R2 = 0.76
R2 = 0.49
alco vil
R2 = 0.24
R2 = 0.22
R2 = 0.64
R2 = 0.57
alc ciu
R2 = 0.71
R2 = 0.57
R2 = 0.60
R2 = 0.56
get leg
R2 = 0.71
R2 = 0.67
R2 = 0.62
R2 = 0.57
ret roz
May. Jun. Jul.
Pollen grains/m3
May. Jun. Jul.
May. Jun. Jul.
May. Ju n. Jul.
May. Jun. Jul.
May. Jun. Jul.
May. Ju n. Jul.
May. Ju n. Jul.
May. Jun. Jul.
May Jun. Jul.
Int J Biometeorol
The results are quite accurate regarding the results from the
10-fold cross and the internal validations. It is worth
remarking that the ANN architecture chosen seems to
underpredict the peaks. Further research on this issue must
be done in order to improve the ANNs performance.
Nevertheless, the models developed improve both peak dates
and Olea pollen concentrations predictions with respect to
more simpler models like predicting based on the previous
day measurements and using models from the ensemble alone.
In this work, the prediction of the Olea pollen season in terms
of phenological timing (peak dates) and pollen intensity (daily
pollen concentrations) was combined in a regional prediction
for the Community of Madrid (central Spain). In a first step,
the LightGBM and ANNs were applied directly to predict
peak_PS DOYs. The predictions resulted to be very accurate
for the peak date and peak values. In a second step, a more
complex ensemble involving two GAM steps previous to the
LightGBM/ANNs showed to be very effective to predict the
entire time series of Olea pollen concentrations.
Predicting the peaks of the pollen season revealed as a
challenging issue because of the difficulties of integrating
the measured biological and environmental characteristics
at local-scale in statistical-based predictive models. More
studies must be conducted to develop new predictive vari-
ables accounting for these features that can help improving
model performance. However, the already obtained high
accuracies make the models suitable to assess both the pe-
riod and the concentrations of the Olea pollen season, and
could be extended to other allergenic species in the future.
This may help health authorities to prevent allergic diseases
Olea, grains/m
Olea, grains/m
LightGBM, grains/m
ANN, grains/m
Naïve, grains/m
Olea, grains/m
r2= 0.81
Fig. 8 Prediction of Olea pollen concentration (estimated vs. observed) using naïve model (a), LightGBM (b)andANN(c)
Fig. 7 Time series of observed Olea concentration (black line) and the
predictions of the ensembles LightGBM (red line) and ANN (blue line).
The predicted peak_PS from the models of line 1 are displayed as vertical
Int J Biometeorol
using predictors such as pollen concentrations for preceding
days or phenological variables which could be estimated in
advance by meteorological factors.
Acknowledgements This study was carried out within the AIRTEC-CM
(urban air quality and climate change integral assessment) scientific pro-
gramme funded by the Directorate General for Research and Innovation
of the Greater Madrid Region (S2018/EMT-4329). The State
Meteorological Agency (AEMET) as well as the PALINOCAM
Network are acknowledged for providing meteorological and palynolog-
ical observations.
Funding This study was carried out within the AIRTEC-CM (urban air
quality and climate change integral assessment) scientific programme
funded by the Directorate General for Research and Innovation of the
Greater Madrid Region (S2018/EMT-4329).
Data availability The sources of data are publicly available on-line: me-
teorological data from AEMET Open Data and pollen data from the
Compliance with ethical standards
Conflict of interest The authors declare that they have no competing
Ethics approval Not applicable.
Consent to participate Not applicable.
Consent for publication Not applicable.
Code availability The code for this work was custom made using free
Open Source libraries from R/Python.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M,
Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga
R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V,
Warden P, Wicke M, Yu Y, Zheng X (2016) TensorFlow: a system
for large-scale machine learning, in: 12th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 16). pp.
Aguilera F, Ruiz Valenzuela L (2012) Altitudinal fluctuations in the olive
pollen emission: an approximation from the olive groves of the
south-east Iberian Peninsula. Aerobiologia (Bologna) 28:403411.
Aguilera F, Ruiz L, Fornaciari M, Romano B, Galan C, Oteros J, Ben
Dhiab A, Msallem M, Orlandi F (2014) Heat accumulation period in
the Mediterranean region: phenological response of the olive in
different climate areas (Spain, Italy and Tunisia). Int J Biometeorol
Akaike H (1974) A new look at the statistical model identification. IEEE
Trans Automat Contr 19(6):716723
Balbus JM, Greenblatt JB, Chari R, Millstein D, Ebi KL (2014) A wedge-
based approach to estimating health co-benefits of climate change
mitigation activities in the United States. Clim Chang 127:199210.
Bastl K, Kmenta M, Berger UE (2018) Defining pollen seasons: back-
ground and recommendations. Curr Allergy Asthma Rep 18:73.
Beggs PJ, Sikoparija B, Smith M (2017) Aerobiology in the International
Journal of Biometeorology, 1957-2017. Int J Biometeorol 61:S51
Borge R, Requia WJ, Yague C, Jhun I, Koutrakis P (2019) Impact of
weather changes on air quality and related mortality in Spain over a
25 year period {[}1993-2017]. Environ Int 133:105272. https://doi.
Burnett R, Chen H, Szyszkowicz M, Fann N, Hubbell B, Pope CA, Apte
JS, Brauer M, Cohen A, Weichenthal S, Coggins J, Di Q,
Brunekreef B, Frostad J, Lim SS, Kan H, Walker KD, Thurston
GD, Hayes RB, Lim CC, Turner MC, Jerrett M, Krewski D,
Yeard ay
High_P S
Id_ro z
Fig. 9 Feature importance plot
using the package LightGBM
taking the number of splits as an
Int J Biometeorol
Gapstur SM, Diver WR, Ostro B, Goldberg D, Crouse DL, Martin
RV, Peters P, Pinault L, Tjepkema M, van Donkelaar A, Villeneuve
PJ, Miller AB, Yin P, Zhou M, Wang L, Janssen NAH, Marra M,
Atkinson RW, Tsang H, Quoc Thach T, Cannon JB, Allen RT, Hart
JE, Laden F, Cesaroni G, Forastiere F, Weinmayr G, Jaensch A,
Nagel G, Concin H, Spadaro JV (2018) Global estimates of mortal-
ity associated with long-term exposure to outdoor fine particulate
matter. Proc Natl Acad Sci 115:95929597.
Chollet, F., 2015. Keras
Chuine I, Cour P, Rousseau DD (1998) Fitting models predicting dates of
flowering of temperate-zone trees using simulated annealing. Plant
Cell Environ 21:455466.
Chuine I, Cour P, Rousseau DD (1999) Selecting models to predict the
timing of flowering of temperate trees: implications for tree phenol-
ogy modelling. Plant Cell Environ 22:113.
Cole-Hunter T, de Nazelle A, Donaire-Gonzalez D, Kubesch N, Carrasco-
Turigas G, Matt F, Foraster M, Martinez T, Ambros A, Cirach M,
Martinez D, Belmonte J, Nieuwenhuijsen M (2018) Estimated effects
of air pollution and space-time-activity on cardiopulmonary outcomes in
healthy adults: a repeated measures study. Environ Int 111:247259.
Cordero JM, Borge R, Narros A (2018) Using statistical methods to carry
out in field calibrations of low cost air quality sensors. Sensors
Actuators B Chem 267:245254.
Cotos-Yanez TR, Rodriguez-Rajo FJ, Jato MV (2004) Short-term predic-
tion of Betula airborne pollen concentration in Vigo (NW Spain)
using logistic additive models and partially linear models. Int J
Biometeorol 48:179185.
DAmato G, Cecchi L, Bonini S, Nunes C, Annesi-Maesano I, Behrendt
H, Liccardi G, Popov T, van Cauwenberge P (2007) Allergenic
pollen and pollen allergy in Europe. Allergy 62(9):976990.
DAmato G, Holgate ST, Pawankar R, Ledford DK, Cecchi L, Al-Ahmad
M, Al-Enezi F, Al-Muhsen S, Ansotegui I, Baena-Cagnani CE,
Baker DJ, Bayram H, Bergmann KC, Boulet LP, Buters JTM,
DAmato M, Dorsano S, Douwes J, Finlay SE, Garrasi D, Gómez
M, Haahtela T, Halwani R, Hassani Y, Mahboub B, Marks G,
Michelozzi P, Montagni M, Nunes C, Oh JJW, Popov TA,
Portnoy J, Ridolo E, Rosário N, Rottem M, Sánchez-Borges M,
Sibanda E, Sienra-Monge JJ, Vitale C, Annesi-Maesano I (2015)
Meteorological conditions, climate change, new emerging factors,
and asthma and related allergic disorders. A statement of the World
Allergy Organization. World Allergy Organ J 8:25.
Diaz J, Linares C, Tobias A (2007) Short-term effects of pollen species on
hospital admissions in the city of Madrid in terms of specific causes
and age. Aerobiologia (Bologna). 23:231238.
Directive 2008/50/EC of the European Parliament and of the Council of
21 May 2008 on ambient air quality and cleaner air for Europe,
2008. , Official Journal of the European Communities
EEA (2019a) European Environment Agency (EEA), 2018a. Air quality
in Europe 2018 report. EEA Report No 12/2018. EEA Report No
EEA (2019b) European Environment Agency (EEA), 2018b. Improving
Europes air quality measures reported by countries, EEA brief-
ing. ISSN 24673196
Fernandez-Rodriguez S, Duran-Barroso P, Silva-Palacios I, Tormo-
Molina R, Maria Maya-Manzano J, Gonzalo-Garijo A (2016a)
Quercus long-term pollen season trends in the southwest of the
Iberian Peninsula. Process Saf Environ Prot 101:152159. https://
Fernandez-Rodriguez S, Duran-Barroso P, Silva-Palacios I, Tormo-
Molina R, Maria Maya-Manzano J, Gonzalo-Garijo A (2016b)
Regional forecast model for the Olea pollen season in
Extremadura (SW Spain). Int J Biometeorol 60:15091517.
Fraga H, Pinto JG, Santos JA (2019) Climate change projections for
chilling and heat forcing conditions in European vineyards and olive
orchards: a multi-model assessment. Clim Chang 152:179193.
Galan I, Prieto A, Rubio M, Herrero T, Cervigon P, Luis Cantero J,
Dolores Gurbindo M, Isabel Martinez M, Tobias A (2010)
Association between airborne pollen and epidemic asthma in
Madrid, Spain: a case-control study. Thorax 65:398402. https://
Galan C, Smith M, Thibaudon M, Frenguelli G, Oteros J, Gehrig R,
Berger U, Clot B, Brandao R, Grp EASQCW (2014) Pollen moni-
toring: minimum requirements and reproducibility of analysis.
Aerobiologia (Bologna) 30:385395.
Galán C, Ariatti A, Bonini M, Clot B, Crouzy B, Dahl A, Fernandez-
González D, Frenguelli G, Gehrig R, Isard S, Levetin E, Li DW,
Mandrioli P, Rogers CA, Thibaudon M, Sauliene I, Skjoth C, Smith
M, Sofiev M (2017) Recommended terminology for aerobiological
studies. Aerobiologia (Bologna) 33:293295.
Google Collaboratory [WWW Document], 2019
Hartley HO (1961) The modified Gauss-Newton method for the fitting of
non-linear regression functions by least squares. Technometrics 3:
Hastie T (2018) Gam: generalized additive models
HOPFIELD JJ (1982) Neural networks and physical systems with emer-
gent collective computational abilities. Proc Natl Acad Sci United
States Am Sci 79:25542558.
Iglesias-Otero MA, Astray G, Vara A, Galvez JF, Mejuto JC, Rodriguez-
Rajo FJ (2015) Forecasting OLEA airborne pollen concentration by
means of artificial intelligence. Fresenius Environ Bull 24:4574
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to
statistical learning: with applications in R. Springer New York.
Kaggle (2019) Accessed 2019
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T.-Y
(2017) LightGBM: a highly efficient gradientboosting decision tree,
in: Guyon, I and Luxburg, UV and Bengio, S and Wallach, H and
Fergus, R and Vishwanathan, S and Garnett, R (Ed.), Advances in
neural information processing systems 30 (NIPS 2017), Advances in
Neural Information Processing Systems. NEURAL
92037 USA
Kuhn M (2017) Caret: classification and regression training
Lake IR, Jones NR, Agnew M, Goodess CM, Giorgi F, Hamaoui-Laguel
L, Semenov MA, Solomon F, Storkey J, Vautard R, Epstein MM
(2017) Climate change and future pollen allergy in Europe. Environ
Health Perspect 125:385391.
Lara B, Rojo J, Fernández-González F, Pérez-Badia R (2019) Prediction
of airborne pollen concentrations for the plane tree as a tool for
evaluating allergy risk in urban green areas. Landsc Urban Plan
Lelieveld J, Evans JS, Fnais M, Giannadaki D, Pozzer A (2015) The
contribution of outdoor air pollution sources to premature mortality
on a global scale. Nature 525, 367+.
Int J Biometeorol
Open Data [WWW Document], 2019
Orlandi F, Garcia-Mozo H, Ben Dhiab A, Galan C, Msallem M, Romano
B, Abichou M, Dominguez-Vilches E, Fornaciari M (2013)
Climatic indices in the interpretation of the phenological phases of
the olive in Mediterranean areas during its biological cycle. Clim
Chang 116:263284.
Osborne, C P, Chuine, I., Viner, D., Woodward, F.I., 2000. Olive phe-
nology as a sensitive indicator of future climatic warming in the
Mediterranean. Plant Cell Environ 23, 701710.
Oteros J, Garcia-Mozo H, Vazquez L, Mestre A, Dominguez-Vilches E,
Galan C (2013) Modelling olive phenological response to weather
and topography. Agric Ecosyst Environ 179:6268.
PALINOCAM [WWW Document], 2019
Pawankar R, Canonica GW, Holgate ST, Lockey RF, Blaiss M (2013)
World Allergy Organisation (WAO) white book on allergy: update
2013. World Allergy Organization, Milwaukee
Perez-Badia R, Rapp A, Morales C, Sardinero S, Galan C, Garcia-Mozo
H (2010) Pollen spectrum and risk of pollen allegry in central Spain.
AAEM 17(1):139151
Perez-Badia R, Bouso V, Rojo J, Vaquero C, Sabariego S (2013)
Dynamics and behaviour of airborne Quercus pollen in central
Iberian Peninsula. Aerobiologia (Bologna) 29:419428. https://doi.
Pfaar O, Bastl K, Berger U, Buters J, Calderon MA, Clot B, Darsow U,
Demoly P, Durham SR, GalanC,GehrigR,vanWijkRG,
Jacobsen L, Klimek L, Sofiev M, Thibaudon M, Bergmann KC
(2018) Defining pollen exposure times for clinical trials of allergen
immunotherapy for pollen-induced rhinoconjunctivitis - an EAACI
position paper. Allergologie 41:386399.
Picornell A, Buters J, Rojo J, Traidl-Hoffmann C,Damialis A, Menzel A,
Bergmann KC, Werchan M, Schmidt-Weber C, Oteros J (2019)
Predicting the start, peak and end of the Betula pollen season in
Bavaria, Germany. Sci Total Environ 690:12991309. https://doi.
Rojo J, Perez-Badia R (2014) Effects of topography and crown-exposure
on olive tree phenology. Trees-Structure Funct 28:449459. https://
Rojo J, Perez-Badia R (2015) Models for forecasting the flowering of
Cornicabra olive groves. Int J Biometeorol 59:15471556. https://
Rojo J, Salido P, Perez-Badia R (2015) Flower and pollen production in
the Cornicabraolive (Olea europaea L.) cultivar and the influence
of environmental factors. Trees-Structure Funct 29:12351245.
Rojo J, Rivero R, Romero-Morte J, Fernandez-Gonzalez F, Perez-Badia
R (2017) Modeling pollen time series using seasonal-trend decom-
position procedure based on LOESS smoothing. Int J Biometeorol
Rojo J, Picornell A, Oteros J (2019) AeRobiology: the computational tool
for biological data in the air. Methods Ecol, Evol
RStudio Team (2015) RStudio: integrated development environment for
Šantl-Temkiv T, Sikoparija B, Maki T, Carotenuto F, Amato P, Yao M,
Morris CE, Schnell R, Jaenicke R, Pöhlker C, DeMott PJ, Hill TCJ,
Huffman JA (2019) Bioaerosol field measurements: challenges and
perspectives in outdoor studies. Aerosol Sci Technol 54:141.
Silva-Palacios I, Fernandez-Rodriguez S, Duran-Barroso P, Tormo-
Molina R, Maria Maya-Manzano J, Gonzalo-Garijo A (2016)
Temporal modelling and forecasting of the airborne pollen of
Cupressaceae on the southwestern Iberian Peninsula. Int J
Biometeorol 60:297306.
Thien F, Beggs PJ, Csutoros D, Darvall J, Hew M, Davies JM, Bardin
PG, Bannister T, Barnes S, Bellomo R, Byrne T, Casamento A,
Conron M, Cross A, Crosswell A, Douglass JA, Durie M, Dyett J,
Ebert E, Erbas B, French C, Gelbart B, Gillman A, Harun N-S,
Huete A, Irving L, Karalapillai D, Ku D, Lachapelle P, Langton
D, Lee J, Looker C, MacIsaac C, McCaffrey J, McDonald CF,
McGain F, Newbigin E, OHehir R, Pilcher D, Prasad S,
Rangamuwa K, Ruane L, Sarode V, Silver JD, Southcott AM,
Subramaniam A, Suphioglu C, Susanto NH, Sutherland MF, Taori
G, Taylor P, Torre P, Vetro J, Wigmore G, Young AC, Guest C
(2018) The Melbourne epidemic thunderstorm asthma event 2016:
an investigation of environmental triggers, effect on health services,
and patient risk factors. Lancet Planet Heal 2:e255e263. https://doi.
World Health Organization [WWW Document], 2019
Xie Y, Dai H, Xu X, Fujimori S, Hasegawa T, Yi K, Masui T, Kurata G
(2018) Co-benefits of climate mitigation on air quality and human
health in Asian countries. Environ Int 119:309318.
PublishersnoteSpringer Nature remains neutral with regard to jurisdic-
tional claims in published maps and institutional affiliations.
Int J Biometeorol
... Machine learning techniques integrate pollen observations, meteorological data, and algorithms to accurately predict daily pollen concentrations. The most commonly applied techniques, including Deep Neural Networks (DNN) [10,11], Random Forests [10][11][12], Light Gradient Boosting Machine (LightGBM) [13], Least Absolute Shrinkage and Selection Operator (LASSO) [10], Artificial Neural Networks (ANN) [13][14][15][16], Extreme Gradient Boosting (XGBoost) [11], a K-mean cluster analysis [15,17], and a Bayesian ridge [11] can estimate phenological metrics and pollen intensity parameters. These techniques have been utilized for predicting pollen concentrations of various species such as Ambrosia [10][11][12]17], Oleaceae [13], Quercus [12], Cupressaceae [12], and Poaceae [12,15,17], and thus constitute valuable tools for allergiological and ecological implementations. ...
... Machine learning techniques integrate pollen observations, meteorological data, and algorithms to accurately predict daily pollen concentrations. The most commonly applied techniques, including Deep Neural Networks (DNN) [10,11], Random Forests [10][11][12], Light Gradient Boosting Machine (LightGBM) [13], Least Absolute Shrinkage and Selection Operator (LASSO) [10], Artificial Neural Networks (ANN) [13][14][15][16], Extreme Gradient Boosting (XGBoost) [11], a K-mean cluster analysis [15,17], and a Bayesian ridge [11] can estimate phenological metrics and pollen intensity parameters. These techniques have been utilized for predicting pollen concentrations of various species such as Ambrosia [10][11][12]17], Oleaceae [13], Quercus [12], Cupressaceae [12], and Poaceae [12,15,17], and thus constitute valuable tools for allergiological and ecological implementations. ...
... The most commonly applied techniques, including Deep Neural Networks (DNN) [10,11], Random Forests [10][11][12], Light Gradient Boosting Machine (LightGBM) [13], Least Absolute Shrinkage and Selection Operator (LASSO) [10], Artificial Neural Networks (ANN) [13][14][15][16], Extreme Gradient Boosting (XGBoost) [11], a K-mean cluster analysis [15,17], and a Bayesian ridge [11] can estimate phenological metrics and pollen intensity parameters. These techniques have been utilized for predicting pollen concentrations of various species such as Ambrosia [10][11][12]17], Oleaceae [13], Quercus [12], Cupressaceae [12], and Poaceae [12,15,17], and thus constitute valuable tools for allergiological and ecological implementations. ...
... In the field of aerobiology, as already mentioned in the introduction, machine learning algorithms have been widely proposed since the nineties. Different studies have been interested in developing models using jointly meteorological, phenological, environmental and also historical concentration data for objectives such as forcasting the presence or absence of a pollen species in a given location, estimating the onset of the pollen season of a species (Andersen, 1991;Cassagne, 2009) or predicting the inter-annual variation of pollen seasons (Spieksma, Emberlin, Hjelmroos, Jäger and Leuschner, 1995), or predict the level of pollen risk or concentration (Cordero, Rojo, Gutiérrez-Bustillo, Narros and Borge, 2021;Castellano-Méndez, Aira, Iglesias, Jato and González-Manteiga, 2005;Sánchez-Mesa, Galán, Martínez-Heras and Hervás-Martínez, 2002;Hidalgo, Mangin, Galán, Hembise, Vázquez and Sanchez, 2002;Ranzi, Lauriola, Marletto and Zinoni, 2003;Iglesias-Otero et al., 2015;Muzalyova, Brunner, Traidl-Hoffmann and Damialis, 2021) Among others, support vector machines (SVM) have been proposed in (Zewdie, Liu, Wu, Lary and Levetin, 2019b), random forests (RF) in (Zewdie et al., 2019b;Nowosad et al., 2018), artificial neural networks (ANN) in (Cordero et Color Key and Histogram Count Figure 6: A typological analysis in space and pollen by hierarchical classification (Manhattan distance and Ward linkage after decomposition into non-negative matrix of the data) allows to establish groupings of sites and pollens. The colors of the sites are defined a priori according to their geographical positions. ...
... "Orange" sites are from north west, "red" sites from south west, "green" sites from north east and "blue" sites from south east. 2019a), regression models in (Box, Jenkins, Reinsel and Ljung, 2015) or the gradient boosting models (Cordero et al., 2021). These frequently used algorithms have led to satisfactory results which are detailed below. ...
... These frequently used algorithms have led to satisfactory results which are detailed below. In (Cordero et al., 2021), the authors are interested in the prediction of the daily concentration of the olive tree in Madrid (Spain). They implemented models based on Light Gradient Boosting Machine (LightGBM) combined with a Generalized Additive Model (GAM) and Artificial Neural Networks (ANN) to predict the day of the year when the peak of the pollen season occurs. ...
... The most relevant variables for the random forest models were the DOY, the general land-use index, and the temperatures (Fig. 6). Here, the DOY represents an interesting variable with phenological meaning; it has been the most common way to consider the intrinsic phenological patterns in forecasting models due to pollen data being a very seasonal variable (Cordero et al., 2021). Phenology represents a key factor, especially in pollen types that include a wide range of species with different pollen production, such as grasses (Brennan et al., 2019), as well as in Plantago, Urticaceae, and Amaranthaceae (Fernández-Illescas et al., 2010;González-Parrado et al., 2015). ...
Airborne pollen concentrations are influenced by wind direction, wind speed, temperature and rainfall among other meteorological variables, but they are conditioned by local land use. Combining land use and wind frequencies within a single variable would allow the estimation of the contribution of the emission sources to the airborne pollen detected. The aim of this study was to develop this new variable and to compare its relevance in estimating airborne pollen concentrations of anemophilous (i.e. wind pollinated) herbaceous taxa versus other meteorological predictors. The airborne pollen concentrations of some herbaceous pollen types were studied in Malaga, Spain (1992-2020). The land-use surfaces were combined with the daily wind direction frequencies to develop an index. This index was relevant for estimating daily airborne pollen concentrations in random forest frameworks since high pollen concentrations were detected on days with high index values. However, the relationship between the index and the pollen concentrations was not linear due to the influence of other variables. Overall, this index constitutes an easy-to-use approach to integrate both land use and wind frequencies in pollen models, and it can be applied to other sampling locations and pollen types.
... Due to the agricultural, environmental, medical and economic interests of the olive tree (Olea europaea), it has been the subject of numerous investigations based generally on aerobiological characteristics. Most recent studies deal with the development of prediction models for pollen concentrations using big data techniques (Rojo et al., 2019) and neural networks (Iglesias-Otero et al., 2015;González-Naharro et al., 2019;Fernández-Rodríguez, 2020;Muzalyova et al., 2021;Cordero et al., 2021). There are also works on the expansion of crops to new areas (Garrido et al., 2021), the genetic renewal of endangered species (Lopez-Orozco et al., 2020), and the development of strategies for biocontainment and mitigation of transgene flow between crops (Zhang et al., 2021), all of them based on the study of medium or long-range transport of pollen. ...
The province of Alicante (southeastern Spain) has a low density of olive trees; however, a substantial increase in the number of patients with allergic sensitization to Olea has been observed in recent years. The present work seeks to identify and quantify the transport of Olea pollen from other source regions, defining a new parameter called External Pollen Index (EPI). The second objective of the study was to determine Olea pollen exposure levels in Alicante city using a new method based on statistical percentiles and linear regression analysis. For these purposes, a study of daily and 2-hour average concentrations of Olea pollen combined with cluster analysis of 2-hour back-trajectories was carried out during the pollination period of 2010-2015. The annual levels of olive pollen recorded in Alicante showed a significant contribution from the Western Iberian Peninsula. Thus, the EPI is defined as the sum of the contributions of pollen transported by air masses from southwestern and northwestern Spain. The good linear correlation beetween the EPI and the APIn (R² = 0.818; p-value 0.013) suggests that the total contribution of pollen from western regions may be a good predictor of the annual concentration of Olea pollen registered in Alicante. It has been estimated that, in the absence of external contributions, the Olea pollen concentration at the study area would be around 1300 pollen grains⋅m⁻³. Olea pollen exposure levels determined in Alicante were: Low (<20 pollen grains⋅m⁻³), Moderate (20–50 pollen grains⋅m⁻³), High (51–100 pollen grains⋅m⁻³) and Very High (>100 pollen grains⋅m⁻³). High and very high exposure levels were associated with northwestern and southwestern contributions, respectively. Future research should consider external contributions (EPI) in prediction models in order to get a better estimation of pollen levels and, consequently, improve the information provided to the allergic population.
Numerous studies show that meteorological conditions have an impact on the emission, dispersion and suspension of pollens in the air. Several allergenic species permanently threaten the health of millions of people in France and that can be extrapolate that this is the case in most part of the world. Hence, preventive information on the risk of pollen exposure would become a real asset for allergy sufferers. The main objective of this article is to study, through statistical learning techniques exploiting historical data and meteorological parameters of the day (T), the ability to predict three-day ahead (T+3) pollens presence risk levels in the air on a given territory (in metropolitan France). We are interested in the prediction of risk, discretized in four levels for three families of pollens which are among the most allergenic species (ragweed, cupressaceae and grasses). Combining binary logistic regression models for each risk level using a set of ranking rules or a random forest classifier is proposed in this study. The pollen risk level prediction performances reach 70% to more than 90% of auc, precision and recall on the majority of 68 considered sites and especially with a similar prediction capacity on sites with no previous pollen data. The comparative study with some more classical models of the literature shows that the proposed model have a slight performance advantage.
Full-text available
Information on the allergenic pollen season provides insight on the state of the environment of a region and facilitates allergy symptom management. We present a retrospective analysis of the duration and severity of the allergenic pollen season and the role of meteorological factors in Istanbul, Turkey. Aerobiological sampling from January 2013 to June 2016, pollen identification and counting followed current standard methodology. Pollen seasons were defined according to 95% of the Annual Pollen Integral (APIn) and the season start date was compared with the first day of 5 day consecutive non-zero records. Generalized additive models (GAMs) were created to study the effect of meteorological factors on flowering. The main pollen contributors were taxa of temperate and Mediterranean climates, and neophytic Ambrosia . Cupressaceae, Poaceae, Pinaceae, Quercus and Ambrosia had the greatest relative abundance. The pollen season defined on 95% of the APIn was adequate for our location with total APIns around 10.000 pollen*day*m ⁻³ . Woody taxa had generally shorter seasons than herbaceous taxa. In trees, we see precipitation as the main limiting factor for assimilate production prior to anthesis. A severe tree pollen season in 2016 suggests intense synchronous flowering across taxa and populations triggered by favourable water supply in the preceding year. GAM models can explain the effect of weather on pollen concentrations during anthesis. Under the climatic conditions over the study period, temperature had a negative effect on spring flowering trees, and a positive one on summer flowering weeds. Humidity, atmospheric pressure and precipitation had a negative effect on weeds. Our findings contribute to environmental and allergological knowledge in southern Europe and Turkey with relevancy in the assessment of impacts of climate change and the management of allergic disease.
Artificial and augmented intelligence (AI) and machine learning (ML) methods are expanding into the health care space. Big data are increasingly used in patient care applications, diagnostics, and treatment decisions in allergy and immunology. How these technologies will be evaluated, approved, and assessed for their impact is an important consideration for researchers and practitioners alike. With the potential of ML, deep learning, natural language processing, and other assistive methods to redefine health care usage, a scaffold for the impact of AI technology on research and patient care in allergy and immunology is needed. An American Academy of Asthma Allergy and Immunology Health Information Technology and Education subcommittee workgroup was convened to perform a scoping review of AI within health care as well as the specialty of allergy and immunology to address impacts on allergy and immunology practice and research as well as potential challenges including education, AI governance, ethical and equity considerations, and potential opportunities for the specialty. There are numerous potential clinical applications of AI in allergy and immunology that range from disease diagnosis to multidimensional data reduction in electronic health records or immunologic datasets. For appropriate application and interpretation of AI, specialists should be involved in the design, validation, and implementation of AI in allergy and immunology. Challenges include incorporation of data science and bioinformatics into training of future allergists-immunologists.
Full-text available
Outdoor field measurements of bioaerosols are performed within a wide range of basic and applied scientific disciplines, each with its own goals, assumptions, and terminology. This paper contains brief reviews of outdoor field bioaerosol research from these diverse interests, with emphasis on perspectives from the atmospheric sciences. The focus is on a high level discussion of pressing scientific questions, grand challenges, and needs for cross-disciplinary collaboration. The research topics, in which bioaerosol field measurement are important, include (i) atmospheric physics, clouds, climate, and hydrological cycle; (ii) atmospheric chemistry; (iii) airborne allergen-containing particles; (iv) airborne human pathogens and national security; (v) airborne livestock and crop pathogens; and (vi) biogeography and biodiversity. We concisely review bioaerosol impacts and discuss properties that distinguish bioaerosols from abiological aerosols. We give extra focus to regions of specific interest, i.e. forests, polar regions, marine and coastal environments, deserts, urban and rural areas, and summarize key considerations related to bioaerosol measurements, such as of fluxes, long-range transport, and from both stationary and vessel-driven platforms. Keeping in mind a series of key scientific questions posed within the diverse communities, we suggest that pressing scientific questions include: (i) emission sources and flux estimates; (ii) spatial distribution; (iii) changes in distribution; (iv) atmospheric aging; (v) metabolic activity; (vi) urbanization of allergies; (vii) transport of human pathogens; and (viii) climate-relevant properties.
Full-text available
Aerobiological databases are constantly increasing. Many of them contain long and extensive time series of data which are very difficult and tedious to manage. The development of new real‐time automatic sampling devices also requires new tools to reduce time of calculations and data management. In this sense, the AeRobiology r package has been implemented to accelerate and facilitate these tasks. This package was structured in three sections based on (a) the checking of the database, (b) calculation of the main aerobiological indexes and (c) visualization of the results. The AeRobiology package contains numerous functions which, in conjunction, solve the main general tasks that scientists must assume for the analysis of the biological data. The package is freely distributed under GNU General Public License and can directly be installed from CRAN ( ). The reference manual is available at . Contact:
Full-text available
Air temperatures play a major role on temperate fruit development, and the projected future warming may thereby bring additional threats. The present study aims at analyzing the impacts of climate change on chilling and heat forcing on European vineyards and olive (V&O) orchards. Chilling portions (CP) and growing degree hours (GDH) were computed yearly for the recent past (1989–2005) and the RCP4.5 and RCP8.5 future scenarios (2021–2080), using several regional-global climate models, also considering model uncertainties and biases. Additionally, minimum CP and GDH values found in 90% of all years were also computed. These metrics were then extracted to the current location of V&O in Europe, and CP-GDH delimitations were assessed. For recent past, high CP values are found in north-central European regions, while lower values tend to exist on opposite sides of Europe. Regarding forcing, southern European regions currently show the highest GDH values. Future projections point to an increased warming, particularly under RCP8.5 and for 2041 onwards. A lower/higher CP is projected for south-western/eastern Europe, while most of Europe is projected to have higher GDH. Northern-central European V&O orchards should still have future CP-GDH similar to present values, while most of southern European orchards are expected to have much lower CP and higher GDH, especially under RCP8.5. These changes may bring limitations to some of the world most important V&O producers, such as Spain, Italy and Portugal. The planning of suitable adaptation measures against these threats is critical for the future sustainability of the European V&O sectors.
Full-text available
Purpose of Review The definition of a pollen season determines the start and the end of the time period with a certain amount of pollen in the ambient air. Different pollen season definitions were used for a long time including the use of different terms for data and methods used to define a pollen season. Recently suggested pollen season definitions for clinical trials were tested and applied for the first time to more aeroallergens. Recent Findings This is a review on pollen season definitions and the latest recommendations. Recently, proposed terminology in aerobiology is promoted here in order to support reproducibility and repeatability in research. Two pollen season definitions, one based on percentages and one based on pollen concentrations, were tested. Summary Percentage definitions can be recommended for standard aerobiological routines and for retrospective applications, whereas pollen concentrations definitions can be recommended for prospective applications such as clinical trials.
Full-text available
Exposure to ambient fine particulate matter (PM2.5) is a major global health concern. Quantitative estimates of attributable mortality are based on disease-specific hazard ratio models that incorporate risk information from multiple PM2.5 sources (outdoor and indoor air pollution from use of solid fuels and secondhand and active smoking), requiring assumptions about equivalent exposure and toxicity. We relax these contentious assumptions by constructing a PM2.5-mortality hazard ratio function based only on cohort studies of outdoor air pollution that covers the global exposure range. We modeled the shape of the association between PM2.5 and nonaccidental mortality using data from 41 cohorts from 16 countries-the Global Exposure Mortality Model (GEMM). We then constructed GEMMs for five specific causes of death examined by the global burden of disease (GBD). The GEMM predicts 8.9 million [95% confidence interval (CI): 7.5-10.3] deaths in 2015, a figure 30% larger than that predicted by the sum of deaths among the five specific causes (6.9; 95% CI: 4.9-8.5) and 120% larger than the risk function used in the GBD (4.0; 95% CI: 3.3-4.8). Differences between the GEMM and GBD risk functions are larger for a 20% reduction in concentrations, with the GEMM predicting 220% higher excess deaths. These results suggest that PM2.5 exposure may be related to additional causes of death than the five considered by the GBD and that incorporation of risk information from other, nonoutdoor, particle sources leads to underestimation of disease burden, especially at higher concentrations.
Climate change is a major public health concern. In addition to its direct impacts on temperature patterns and extreme weather events, climate change affects public health indirectly through its influence on air quality. Pollution trends are not only affected by emissions changes but also by weather changes. In this paper we analyze air quality trends in Spain of important air pollutants (C 6 H 6 , CO, NO 2 , NO x , O 3 , PM 10 , PM 2.5 , and SO 2) recorded during the last 25 years, from 1993 to 2017. We found substantial reductions in ambient concentration levels for all the pollutants studied except for O 3. To assess the influence of recent weather changes on air quality trends we applied generalized additive models (GAMs) using nonparametric smoothing; with and without adjusting for weather parameters including temperature, wind speed, humidity and precipitation frequency. The difference of annual slopes estimated by the models without and with adjusting for these meteorological variables represents the impact of weather changes on pollutant trends, i.e. the 'weather penalty'. The analyses were seasonally and geographically stratified to account for temporal and regional differences across Spain. The results were meta-analyzed to estimate weather penalties on ambient concentration trends at a national level as well as the impact on mortality for the most relevant pollutants. We found significant penalties for most pollutants , implying that air quality would have improved even more during our study period if weather conditions had remained constant. The largest weather influences were found for PM 10 , with seasonal penalties up to 22 μg⋅m −3 accumulated over the 25-year period in some regions. The national meta-analysis shows penalties of 0.060 μg⋅m −3 per year (95% Confidence Interval, CI: 0.004, 0.116) in cold months and 0.127 μg⋅m −3 per year (95% CI: 0.089, 0.164) in warm months. Penalties of this magnitude would correspond to 129 annual deaths (95% CI: 25, 233), i.e. approximately 3200 deaths over the 25-year period in Spain. According to our results, the health benefits of recent emission abatements for this pollutant in Spain would have been up to 10% greater if weather conditions had remained constant during the last 25 years.
The different species of the genus Platanus, commonly known as plane trees, are widely grown as an ornamental species in Mediterranean cities over recent years. The pollen of these species is a major source of allergens. However, surprisingly little published research has addressed methods for predicting the allergy risk prompted by this pollen. In this work, we developed models for predicting airborne Platanus pollen concentrations constructed using data from central Spain. Predictions are very useful to alert citizens and give allergy patients advanced warning of expected high airborne pollen concentrations. The prediction models indicate that airborne Platanus pollen concentrations can be forecasted up to seven days (one week) in advance, using a method which combines the analysis of long-term aerobiological data (in this case, over the eleven-year study period), in order to detect seasonal trends within time-series, with the modelling of short-term fluctuations in airborne pollen concentrations prompted by daily changes in meteorological variables and pollen concentrations over the previous days. The meteorological variables studied were maximum and minimum temperature, rainfall and relative humidity. The results of the validation of prediction models yielded a coefficient of correlation between observed and predicted values of R = 0.7, indicating that these models predict most of the pollen peaks in the airborne pollen curve.
Betula pollen is frequently found in the atmosphere of central and northern Europe. Betula pollen are health relevant as they cause severe allergic reactions in the population. We developed models of thermal requirements to predict start, peak and end dates of the Betula main pollen season for Bavaria (Germany). Betula pollen data of one season from 19 locations were used to train the models. Estimated dates were compared with observed dates, and the errors were spatially represented. External validation was carried out with time series datasets of 3 different locations (36years in total). RESULTS: The temperature requirements to detonate the main pollen season proved non-linear. For the start date model (error of 8,75days during external validation), daily mean temperatures above a threshold of 10°C from 28th of February onwards were the most relevant. The peak model (error of 3.58days) takes into account mean daily temperatures accumulated since the first date of the main pollen season in which the daily average temperature exceeded 11°C. The end model (error of 3.75days) takes into account all temperatures accumulated since the start of the main pollen season. CONCLUSION: These models perform predictions that enable the allergic population to better manage their disease. With the established relationship between temperatures and pollen season dates, changes in the phenological behaviour of Betula species due to climate change can be also estimated in future studies by taking into account the different climate scenarios proposed by previous climate change studies.
Background: Clinical efficacy of pollen allergen immunotherapy (AIT) has been broadly documented in randomized controlled trials. The underlying clinical endpoints are analysed in seasonal time periods predefined based on the background pollen concentration. However, any validated or generally accepted definition from academia or regulatory authorities for this relevant pollen exposure intensity or period of time (season) is currently not available. Therefore, this Task Force initiative of the European Academy of Allergy and Clinical Immunology (EAACI) aimed to propose definitions based on expert consensus. Methods: A Task Force of the Immunotherapy and Aerobiology and Pollution Interest Groups of the EAACI reviewed the literature on pollen exposure in the context of defining relevant time intervals for evaluation of efficacy in AIT trials. Underlying principles in measuring pollen exposure and associated methodological problems and limitations were considered to achieve a consensus. Results: The Task Force achieved a comprehensive position in defining pollen exposure times for different pollen types. Definitions are presented for “pollen season”, “high pollen season” (or “peak pollen period”) and “high pollen days”. Conclusion: This EAACI position paper provides definitions of pollen exposures for different pollen types for use in AIT trials. Their validity as standards remains to be tested in future studies.
This book describes an array of power tools for data analysis that are based on nonparametric regression and smoothing techniques. These methods relax the linear assumption of many standard models and allow analysts to uncover structure in the data that might otherwise have been missed. While McCullagh and Nelder's Generalized Linear Models shows how to extend the usual linear methodology to cover analysis of a range of data types, Generalized Additive Models enhances this methodology even further by incorporating the flexibility of nonparametric regression. Clear prose, exercises in each chapter, and case studies enhance this popular text.