ArticlePDF Available

Introducing Gradient Boosting as a universal gap filling tool for meteorological time series

Authors:

Abstract and Figures

In this article, Gradient Boosting (gb) is introduced as an easily adaptable machine learning method to fill gaps caused by missing or erroneous data in meteorological time series. The gb routine is applied on a large data set of hourly time series of the measurands: air temperature, wind speed and relative humidity for station-based observations in Germany covering the period from 1951 to 2015. Our analysis shows that Gradient Boosting produces small errors by estimating missing values with median RMSE for temperature of 0.73 °C, wind speed of 0.82 m/s, relative humidity of 4.3 %, respectively. The comparison between the results achieved by Gradient Boosting with results from other gap filling techniques like neural networks or multiple linear regression shows considerably better statistics. The comparison clearly shows that the Gradient Boosting approach outperforms the other techniques particularly in calculation time, performance and the handling of missing predictor values.
Content may be subject to copyright.
Uncorrected Proof
BMeteorologische Zeitschrift, PrePub DOI 10.1127/metz/2018/0908 PrePub Article
© 2018 The authors
Introducing Gradient Boosting as a universal gap filling tool
for meteorological time series
Philipp Körner, Rico Kronenberg, Sandra Genzel and Christian Bernhofer
Technische Universität Dresden
(Manuscript received February 8, 2018; in revised form August 14, 2018; accepted August 14, 2018)
Abstract
In this article, Gradient Boosting (gb) is introduced as an easily adaptable machine learning method to fill gaps
caused by missing or erroneous data in meteorological time series. The gb routine is applied on a large data
set of hourly time series of the measurands: air temperature, wind speed and relative humidity for station-
based observations in Germany covering the period from 1951 to 2015. Our analysis shows that Gradient
Boosting produces small errors by estimating missing values with median RMSE for temperature of 0.73 °C,
wind speed of 0.82 m/s, relative humidity of 4.3 %, respectively. The comparison between the results achieved
by Gradient Boosting with results from other gap filling techniques like neural networks or multiple linear
regression shows considerably better statistics. The comparison clearly shows that the Gradient Boosting
approach outperforms the other techniques particularly in calculation time, performance and the handling of
missing predictor values.
Keywords: gap filling, missing values, hourly resolution, gradient boosting, xgboost, air temperature, wind
speed, relative humidity, Germany, xgb, imputation
1 Introduction1
Gap filling of meteorological time series of sites is a2
necessary issue for a wide range of tasks like time series3
analysis, hydrological, meteorological or climatological4
modelling and working with fluxes, where continuous5
data series are required. Therefore a broad selection of6
methods for the task of dealing with missing data is7
known.8
A linear interpolation from adjacent time steps is rea-9
sonable, in case a few data points are missing (Reich-10
stein et al., 2005). Autoregressive models were found11
suitable for longer periods of missing values without ad-12
jacent sites as predictors (Giller, 2012;Shahin et al.,13
2014).14
In the case adjacent sites are available, on the one15
hand, spatial interpolation like inverse distance weight-16
ing (Xia et al., 2001), Kriging (Stein, 2012) and thin17
plate spline interpolation (Hijmans et al., 2005)are18
commonly used. On the other hand, data-driven methods19
like nearest neighbor (Junninen et al., 2004), linear or20
multiple linear regression (Junninen et al., 2004;Kot-21
siantis et al., 2006;Ramos-Calzado et al., 2008) look22
up tables (Reichstein et al., 2005) or artificial neural23
networks (Falge et al., 2001; Reichstein et al., 2005)24
were found to be successful for the task of filling gaps25
in time series.26
In this paper, “Gradient Boosting” (gb), a method27
commonly used in the field of data mining (Lawrence28
et al., 2004; Li et al., 2007), is introduced as an alter-29
native to the aforementioned approaches. Therefore, a30
Corresponding author: Philipp Körner, Technische Universität Dresden,
Pienner Straße 23, 01737 Tharandt, philipp.koerner@tu-dresden.de
large data set of wind speed, air temperature and rela- 31
tive humidity in hourly resolution for up to 588 sites in 32
Germany is created. This data served as test case to fill 33
gaps of missing values in time series. The objective of 34
this paper is to demonstrate that the Gradient Boosting 35
method can be applied to fill gaps in time series to reach 36
a maximum of performance. 37
2 Material and methods 38
2.1 Case study region 39
The investigated sites cover the whole territory of the 40
federal state of Germany from the coast of the Baltic 41
Sea and North Sea with an elevation of 0 m a.s.l. and 42
the lowlands in the north to the Alps in the south and 43
the highest mountain “Zugspitze” with an elevation of 44
2962 m a.s.l. The Climate is characterized as a transition 45
region from Oceanic Climate of Western Europe to Con- 46
tinental Climate of Eastern Europe. Mean annual tem- 47
perature spreads from 4.8 °C at mountain Zugspitze 48
to 11 °C at the Upper Rhine Plain in the period of 49
1981–2010. 50
2.2 Meteorological data 51
We are analyzing hourly data of relative humidity (rH), 52
temperature (T) and wind speed (u). The period of in- 53
vestigation is 1951 to 2015. The data are provided by the 54
German Weather via public ftp Server. We downloaded 55
the data in March 2016. 56
Altogether, we collected 588 time series of observed 57
relative humidity and temperature and 397 time series 58
© 2018 The authors
DOI 10.1127/metz/2018/0908 Gebrüder Borntraeger Science Publishers, Stuttgart, www.borntraeger-cramer.com
Uncorrected Proof
2P. Körner etal.: Introducing Gradient Boosting as a universal gap filling tool Meteorol. Z., PrePub Article, 2018
Figure 1: Temporal availability of sites for the parameters tempera-
ture, relative humidity and wind speed.
of observed wind speed data. Since several sites were59
closed down, the maximum of available sites per year is60
less than the sum of the available sites over time. We61
analyzed only hourly time series with a minimum of62
at least 20,000 observations. Fig. 1shows the temporal63
availability of temperature, relative humidity and wind64
speed sites, respectively.65
The data provided by DWD was quality checked by66
the DWD (Behrendt, 1992, quoted in Isotta et al.,67
2014), but still we could find suspicious values in the68
data. Our suspicion was raised by temporal drifts, erro-69
neous values or trends in the series. Since the data and70
methods presented are freely available, we did not cor-71
rect or exclude erroneous values to keep the results of72
our study reproducible.73
2.3 Gradient Boosting74
Gradient Boosting was introduced by Friedmann75
(1999) and Friedmann et al. (2000). We refer to both76
papers for a detailed mathematical description of the ap-77
proach. However, the approach is about the building of78
an ensemble of regression trees with an output depend-79
ing on one or more predictors. For instance, a single tree80
model makes a single value prediction, but has a low per-81
formance. The advantage of the approach is that it uses82
not only one, but many of these trees. The number of re-83
gression trees depends on the investigated data and the84
number of leafs of a single tree. Due to the divergence85
within the data set and its size, the approach establishes86
hundreds to thousands of trees. The result is calculated87
as the sum of the output of all the regression trees the88
model is built of (see Fig. 2).89
Gradient Boosting is classified as machine learning90
method, which is why the approach needs a learning91
phase with a training data set. In this training period92
Figure 2: gb-Regression tree ensemble; the predicted temperature
value is the sum of the predicted values of the respective leafs
(Modified after Chen and Guestrin, 2016).
either a predefined number of trees or the ‘best’ number 93
of trees is built, which is decided by a cross validation in 94
runtime. The orange boxes (Fig. 2) with the probabilities 95
are called leafs, and, additionally, every prediction (blue 96
boxes) and every leaf is a node. 97
Fig. 2gives an example how the results are calcu- 98
lated. The goal of the gb-approach is to predict the tem- 99
perature value at an arbitrary site X. Let us assume we 100
have two predictors, Temperatures T1 and T2 measured 101
at two neighboring sites. The model is trained for sev- 102
eral time steps, which optimizes thresholds (circles in 103
Fig. 2) and predictions (rectangles in Fig. 2) for each tree 104
(Friedmann, 1999;Chen and Guestrin, 2016) with a 105
greedy algorithm. Target figure is a minimum RMSE. 106
After the training period, the model runs through all the 107
regression trees and sums up the predictions of the re- 108
spective trees to predict the value for a certain time step. 109
In Case 1, Tree 1 results in 4 °C, Tree 2 adds up 1 °C, 110
whichsumsuptoatemperatureof3 °C at the site X. 111
Case 2 we achieve a temperature value for site X of 1 °C. 112
However, the tree architecture is not limited to our ex- 113
ample. One tree may contain several leafs, typically 2 to 114
few hundred; the nodes in one tree may recall different 115
predictors and a predictor can occur multiple times in 116
different trees. 117
The example aside, for every respective site and each 118
parameter, a single model is built with the “R”-package 119
“xgboost” (xgb) (Chen and Guestrin, 2016). The ad- 120
ditional “x” is for “extreme” and refers to a faster opti- 121
mization algorithm. The data set is split into two parts, 122
training and validation period. In the training period, the 123
regression trees are built providing a minimum RMSE 124
between model prediction and measured values in the 125
training period. Afterwards, the model calculates values 126
for the validation period, which are used to calculate the 127
statistical measures. For each parameter of a respective 128
station, the validation time steps are different and inde- 129
pendent from each other. The time series of the predic- 130
Uncorrected Proof
Meteorol. Z., PrePub Article, 2018 P. Körner etal.: Introducing Gradient Boosting as a universal gap filling tool 3
tors are selected through the example of the temperature
131
at the arbitrary site “X” as follows:132
All time series of temperature are selected; except the133
time series of site “X”.134
All time series with less than 8760 overlapping135
hourly time steps (approx. one year) with site “X“136
in calibration period are excluded.137
All time series with less than 8760 overlapping138
hourly time steps (approx. one year) with site “X”139
in validation period are excluded.140
The 100 sites, that are closest to the site “X” by Eu-141
clidian distance and fulfill the previous requirements,142
are selected as predictors.143
The aforementioned steps are performed for every144
single site and meteorological parameter, respectively.145
The following error corrections were necessary: in all146
cases, generated negative values for wind speed and147
for relative humidity are set zero, values above 100 %148
humidity are set to 100 %.149
Our results display only the validation period (Ta-150
ble 4to Table 5).151
2.4 Multiple linear regression152
Two other commonly used methods, multiple linear re-153
gression and neural networks, are used for comparison154
with gb-results. Due to calculation time issues, this com-155
parison was done with a reduced data set (Section 4).156
The best possible multiple linear regression model157
(mlr) with the other sites for a considered time step is158
applied to generate values as a replacement for miss-159
ing values. The procedure is performed using R soft-160
ware (RCoreTeam, 2015). The R package leaps (Lum-161
ley, 2017) offers an efficient branch-and-bound algo-162
rithm (Land and Doig, 1960) to perform an exhaustive163
search for the best subset. The regression model with the164
highest adjusted R2is selected and subsequently com-165
puted with all available pairs of variates and applied for166
the gap filling.167
2.5 Neural network168
The applied neural networks (nn) are designed as169
feed-forward neural networks (Dibike and Coulibaly,170
2006). The hidden layer was fitted to five neurons with171
sigmoid activation functions. The output neurons have172
a linear activation function. The weights are adjusted173
by the Levenberg-Marquardt algorithm in the training174
(More, 1977). Target figure is a minimum RMSE. The175
input consists of all the available station data of the176
reduced data set (see Section 4) without the station,177
which is filled by the approach. The nn output is vali-178
dated against the observed values of the respective sta-179
tion filled by the network. Due to a random initialization,180
the neural network routines are applied for each site ten181
times, the best net with minimum RMSE for a site is182
selected for the comparison.183
2.6 Statistical evaluation 184
The following measures of performance are used to 185
evaluate the results: root mean square error (RMSE), 186
mean absolute error (MAE) and squared coefficient of 187
correlation (R2). These measured are defined by: 188
RMSE =n
i=1(xiyi)2
n(2.1)
with xias the measured value at time i,yias the predicted 189
value at time ifor npredictions. 190
MAE is defined as 191
MAE =n
i=1|xiyi|
n(2.2)
R2equals the square of the Pearson correlation coeffi- 192
cient. 193
The lower the RMSE, MAE and the higher the R2194
value, the closer the results are to the measured values. 195
3 Results and discussion 196
The presented hourly gap filling approach was per- 197
formed for observed hourly temperature, relative humid- 198
ity and wind speed for the available DWD-data set from 199
1951 to 2015. Statistical measures RMSE, MAE, R2are 200
given for minimum, 25th percentile, median, 75th Per- 201
centile and Maximum, respectively. All of our results are 202
based of the 50 % validation data, which was excluded 203
from the training. 204
3.1 Temperature 205
Table 1shows the analysis based on the quantiles of the 206
resulting measures. Besides the minimum and maximum 207
values, the 25th quantile, median and 75th quantile (Q25, 208
Q50, and Q75, respectively) are 209
The measure performance for temperature at the sites 210
spreads from 0.15 to 0.5 °C (DIN IEC 751, 1990). Con- 211
sidering the median MAE of 0.52 °C achieved by gb is 212
sligthly out of this range. Combined with the high R2-213
values Gradient Boosting shows actually a strong per- 214
formance. 215
Fig. 3a) shows the scatter plot for the site “Hahn”. As 216
can be seen, there are values of 20 °C, 0 °C and 30 °C, 217
Table 1: Measures of performance for modeling hourly values for
air temperature.
RMSE [°C] MAE [°C] R2[–]
Minimum 0.43 0.31 0.943
Q25 0.66 0.47 0.989
Q50 0.73 0.52 0.991
Q75 0.82 0.59 0.993
Maximum 1.7 1.24 0.997
Uncorrected Proof
4P. Körner etal.: Introducing Gradient Boosting as a universal gap filling tool Meteorol. Z., PrePub Article, 2018
(a) (b)
Figure 3: Model validation results for hourly air temperature compared with the measured values for a) site Hahn (DWD ID: 5871,
RMSE = 0.84 °C, R2= 0.988) and b) site “Münster” (DWD ID: 3404) as an example for median skills at RMSE (RMSE = 0.74°C,
R2= 0.990).
Table 2: Measures of performance for modeling hourly values of
relative humidity.
RMSE [%] MAE [%] R2[–]
Minimum 2.7 1.9 0.53
Q25 3.8 2.7 0.91
Q50 4.3 3.0 0.94
Q75 5.0 3.5 0.95
Maximum 16.3 11.5 0.98
respectively, in the site data that are not out of the cli-218
matic range, but still erroneous. For instance, a sudden219
jump from 10 °C to 20 °C and back to 10 °C occurs.220
This causes a relatively high RMSE of 0.84 °C. Never-221
theless a high correlation of 0.988 could be achieved.222
Fig. 3b) shows validation results for the site “Münster”223
as a representative sample of the data set without erro-224
neous measurements.225
3.2 Relative humidity226
The statistics of the measures of performance for the gb227
results of relative humidity are presented as minimum,228
maximum, Q25, Q50 and Q75, respectively in Table 2.229
The R2-value, as well as the error-values, show a230
very high gap filling performance. Compared to the231
measuring performance at the sites, which scatters from232
1% to 3% (HMP45A/D User Guide, 2006), the me-233
dian MAE-value is within this range. However, there234
are single stations with a significant lower perfor-235
mance. All these sites have in common that they ei-236
ther are exposed single mountain sites like “Zugspitze”237
(2964 m a.s.l., RMSE: 16.3 %, R2: 0.53) or “Feldberg/ 238
Schwarzwald” (1490 m, 13.4 %, 0.62) or coastal sites 239
on islands like “Arkona” (43 m, 6.2 %, 0.70) or “Helgo- 240
land” (4 m, 5.3 %, 0.71). These exposed sites are often 241
influenced by air masses, which are not represented in 242
the measured values at the other stations. 243
Therefore, an additional model run with a changed 244
configuration was performed, to show the potential of a 245
possible improvement: 246
First, the data set is divided into 5-year sub data 247
sets. This time step increases the amount of available 248
predictor sites, especially during the first decades with 249
lower observation density, see Fig. 1.250
Second, the temperature values of the closest 100 251
stations per time step to a respective site, are added as 252
predictors. Generally, correlations between temperature 253
and relative humidity over a year are quite poor; there- 254
fore, a significant increase of performance might not be 255
expected. However, the aforementioned structure in the 256
gradient boosting model can also be interpreted as some 257
kind of weather pattern classification. Hence, the addi- 258
tional temperature information is mostly used to decide 259
which weather pattern is predominant at a certain time 260
step. 261
If the temperature of the gap-filled station itself is in- 262
cluded, the improvement is much higher than without, 263
but to get results that are more realistic, these tempera- 264
ture time series were excluded as predictors. The statis- 265
tics of the measures are shown in Table 3.266
Both Table 3and Fig. 4show considerably better re- 267
sults in case that temperature is taken into account as 268
additional predictor. The improvement is obvious in all 269
statistical parameters and sites. It affects especially the 270
Uncorrected Proof
Meteorol. Z., PrePub Article, 2018 P. Körner et al.: Introducing Gradient Boosting as a universal gap filling tool 5
(a) (b)
Figure 4: Histogram of R2(a) and RMSE (b) for hourly rH gap filling. Red bars show distribution with rH as predictor, blue bar with rH
and T as predictors and the subdivision in 5-year sub data sets.
Table 3: Measures of performance for modeling hourly values of
relative humidity with additionally temperature as predictor, divided
in 5-year sub data sets.
RMSE [%] MAE [%] R2[–]
Minimum 2.5 1.8 0.75
Q25 3.2 2.3 0.93
Q50 3.5 2.5 0.96
Q75 4.3 3.1 0.97
Maximum 11.9 7.9 0.98
sites with lower performance. For instance, RMSE at271
medium performance decreases from 4.3 to 3.5 %, thus272
temperature as an additional predictor explains approxi-273
mately 20 % of the residual variation. It is expected that,274
adding wind speed, wind direction, air pressure or radi-275
ation information as predictors would lead to a further276
improvement of the gap filling performance.277
Fig. 5shows scatterplots for site “Fritzlar (Flug-278
platz)” as an example for median skills at RMSE for279
(a) relative humidity as predictor and (b) relative hu-280
midity and Temperature as predictors. Direct compari-281
son and statistical measures show the increased perfor-282
mance at the example site.283
3.3 Wind speed284
The measures of performance of the results achieved by285
gb are summarized in Table 4. The table shows an analy-286
sis based on the quantiles of the resulting measures. Be-287
sides the minimum and maximum values, the 25th quan-288
tile, median and 75th quantile (Q25, Q50, and Q75, re-289
spectively) are shown.290
Table 4: Measures of performance for modeling hourly values of
wind speed.
RMSE [m/s] MAE [m/s] R2[–]
Minimum 0.44 0.31 0.47
Q25 0.70 0.52 0.79
Q50 0.82 0.61 0.84
Q75 0.96 0.72 0.88
Maximum 3.04 2.29 0.94
The data resolution at the sites spreads from 0.1 291
to 1.0 m/s, since measure devices changes in time 292
(Deutscher Wetterdienst, 2017). The median MAE 293
of 0.61 m/s over all sites lies with this range, combined 294
with the high R2-values, Gradient Boosting shows a 295
strong gap filling performance. 296
An example of outliers can be seen in Fig. 6a), where 297
wind speed above 30 m/s is considered as a suspicious 298
observation. The site Hohn provides wind speed data 299
from 1969 to 2015. Approximately 30 probably erro- 300
neous dates in this period did not influence the high 301
correlation (R2=0.89). In comparison Fig. 6b) shows 302
the validation for the site Manschow, without suspicious 303
observations in the time series and an even smaller R2-304
value. 305
Fig. 7presents the spatial distribution of the gap 306
filling performance (RMSE) at the sites for wind speed 307
(a), relative humidity (b, 5-year steps, rH and T as 308
predictors) and air temperature (c), respectively. The 309
aforementioned weaker performance at higher altitudes 310
and close to the border is visible as well as the overall 311
high performance for all the measurands. 312
Uncorrected Proof
6P. Körner etal.: Introducing Gradient Boosting as a universal gap filling tool Meteorol. Z., PrePub Article, 2018
(a) (b)
Figure 5: Measured and predicted values for hourly relative humidity at site “Fritzlar (Flugplatz)” (DWD ID: 1504) as an example for
median skills at RMSE with a) rH as predictor (RMSE = 4.27%, R2= 0.934) and b) rH and Temperature as predictors (RMSE = 3.48 %,
R2= 0.957).
(a) (b)
Figure 6: Model results for hourly wind speed compared with the measured values at a)‘site Hohn (DWD ID: 2303, RMSE = 0.85 m/s,
R2= 0.89) and b) site “Manschow” (DWD ID: 3158) as an example for median skills at RMSE (RMSE = 0.82 m/s, R2= 0.86).
4 Comparison of different gap filling313 approaches314
Due to the out of scale calculation time for nn and315
mlr the dataset was reduced to a test data set with316
lower amount of stations, 16, and a shorter time period,317
1990–2015, for the comparison between gradient boost-318
ing, nn and mlr. Methods gb, nn and mlr run the same319
tasks with the same predictors. For instance, temperature320
at site 1 is calculated with the temperature values at sites321
2 to 16 as predictors, temperature at site 2 is calculated 322
with temperature at sites 1 and 3 to 16 as predictors. 323
50 % of the time series was used for calibration, the 324
other 50 % for validation. The results are shown for the 325
validation period. 326
Table 5summarizes the performance of the gradient 327
boosting method, neural networks and multiple linear 328
regression, respectively. Due to the reduced number of 329
predictors, it does not show the accessible performance 330
of the methods (see Table 1to Table 4). 331
Uncorrected Proof
Meteorol. Z., PrePub Article, 2018 P. Körner et al.: Introducing Gradient Boosting as a universal gap filling tool 7
Figure 7: spatial distribution of RMSE for hourly wind speed (a), relative humidity (b) and temperature (c) at the sites.
Table 5: Mean statistical measures for temperature, relative humid-
ity and wind for gradient boosting (xgb), neural networks (nn) and
multiple linear regression (mlr).
Temperature RMSE [°C] MAE [°C] R2[–] Calculation
time [min]
xgb 1.42 1.05 0.965 1.4
nn 1.57 1.16 0.956 900
mlr 2.09 1.54 0.926 470
Relative
Humidity
RMSE [%] MAE [%] R2[–] Calculation
time [min]
xgb 8.8 6.4 0.71 1.2
nn 9.3 6.9 0.67 470
mlr 12.1 8.9 0.51 360
Wind Speed RMSE [m/s] MAE [m/s] R2[–] Calculation
time [min]
xgb 1.52 1.16 0.55 0.9
nn 1.56 1.20 0.52 360
mlr 1.97 1.45 0.40 255
The comparison shows that xgb performs best ac-332
cording to all statistical measures for meteorological pa-333
rameters temperature, relative humidity and wind speed,334
followedbynnandmlr.335
More precisely, xgb performs best for every single336
site for all the three meteorological measured values337
concerning RMSE, MAE and R2. Additionally, the cal-338
culation speed is in the worst case still 255 times faster339
if compared to mlr.340
Running xgb for the comparison task in this section341
lasted 210 seconds in total, while nn lasted 29 hours and342
mlr 18 hours on the same machine. Moreover, xgb and343
mlr are, more or less, linear scalable, while calculation 344
time at nn increases in an exponential way. 345
5Conclusion 346
Gradient boosting proved to be a very suitable tool to 347
fill gaps in meteorological time series. There are several 348
advantages compared to other methods like multiple 349
linear regression or neural networks. 350
Computation time: The calculations can be done with 351
an ordinary desktop PC in 1/500 to 1/300 of the time 352
for the test data set compared with neural networks 353
and multiple linear regression, respectively. 354
Handling of missing values in the predictor data set: 355
There is no need to fill missing predictor data; the 356
procedure learns how to handle missing data the best 357
way in the training period. 358
• Performance: Gap filling performance shows very 359
good results which can and should be further im- 360
proved by preprocessing the data, like excluding out- 361
liers and applying detrending for wind speed time se- 362
ries as well as calculating the wind speed at the stan- 363
dard height of 10 meters. 364
Additionally, the results can be improved by splitting 365
the data set into decades or even smaller periods and 366
use the 100 closest stations as predictors per time step. 367
Moreover, using 90 % of the data instead of 50 % for the 368
training period will improve the results. Improvements 369
also can be achieved, if other parameters even with 370
low correlations, like temperature and relative humidity, 371
are used as combined predictors. However, due to the 372
additive calculation, the performance for extremes, are 373
expected to be relatively weak. 374
Uncorrected Proof
8P. Körner etal.: Introducing Gradient Boosting as a universal gap filling tool Meteorol. Z., PrePub Article, 2018
Finally, as pointed out, xgb outperformed other ap-375
proaches in the achieved gap filling performance and in376
the computational time. First tests with further measures377
like air pressure, precipitation and fluxes (not shown)378
suggest the method would be applicable for a wide range379
of measurands.380
Acknowledgments381
We want to thank Klemens Barfus for proofreading,382
the reviewers for many helpful hints and the German383
Weather Service for the public data access.384
References385
Behrendt, J., 1992: Dokumentation ROUTKLI: Beschreibung386
der Prüfkriterien im Programmsystem QUALKO. – Deutscher387
Wetterdienst, Offenbach a.M., 12 pp.388
Dibike, Y.B.,P. C ou li baly, 2006: Temporal neural networks for389
downscaling climate variability and extremes. – Neural Netw.390
19, 135–144. DOI:10.1016/j.neunet.2006.01.003.391
DIN IEC 751, 1990: Industrielle Platin-Widerstandsthermometer392
und Platin-Messwiderstände.393
Friedman, J.H., 1999: Greedy function approximation: a gradi-394
ent boosting machine. – Ann. Statistics, 1189–1232, https://395
statweb.stanford.edu/~jhf/ftp/trebst.pdf.396
Friedman, J.,T. Hastie,R. Tibshirani, 2000: Additive logistic397
regression: a statistical view of boosting (with discussion and398
a rejoinder by the authors). – Ann. Statistics 28, 337–407.399
Giller, G.L., 2012: A Seasonal Autoregressive Model for the400
Central England Temperature Series. – SSRN Scholarly Pa-401
per. Rochester, NY: Social Science Research Network. https://402
papers.ssrn.com/abstract=1997813.403
Hijmans, R.J.,S.E. Cameron,J.L. Parra,P. G. J one s ,404
A. Jarvis, 2005: Very High Resolution Interpolated Cli-405
mate Surfaces for Global Land Areas. – Int. J. Climatol. 25,406
1965–78. DOI:10.1002/joc.1276.407
Isotta, F.A.,C. Frei,V. Weilguni,M. Perˇ
cec Tadi´
c,408
P. Lassègues,B. Rudolf,V. Pavan, 2014: The Climate of409
Daily Precipitation in the Alps: Development and Analysis of410
a High-Resolution Grid Dataset from Pan-Alpine Rain-Gauge411
Data. – Int. J. Climatol. 34, 1657–75. DOI:10.1002/joc.3794.412
413
Junninen, H.,H. Niska,K. Tuppurainen,J. Ruuskanen,414
M. Kolehmainen, 2004: Methods for imputation of missing 415
values in air quality data sets. – Atmos. Env. 38, 2895–2907. 416
DOI:10.1016/j.atmosenv.2004.02.026.417
Kotsiantis, S.,A. Kostoulas,S. Lykoudis,A. Argiriou,418
K. Menagias, 2006: Filling missing temperature values in 419
weather data banks. – In: 2nd IET International Conference on 420
Intelligent Environments – IE 06, 1:327-34, published online. 421
DOI:10.1049/cp:20060659.422
Land, A.H.,A.G. Doig, 1960: An Automatic Method of Solv- 423
ing Discrete Programming Problems. – Econometrica 28,424
497–520. DOI:10.2307/1910129.425
Lumley, T., 2017: Based on Fortran code by Alan Miller:426
leaps: Regression Subset Selection. R package version 3.0. – 427
https://CRAN.R-project.org/package=leaps.428
More, J.J., 1977: The Levenberg-Marquardt algorithm: Imple- 429
mentation and theory. – Argonne National Laboratory, Ar- 430
gonne, Illinois. 431
RCoreTeam, 2015: R: A Language and Environment for Sta- 432
tistical Computing. – R Foundation for Statistical Computing, 433
Vienna, Austria. 434
Ramos-Calzado, P.,J. Gómez-Camacho,F. Pérez-Bernal,435
M.F. Pita-López, 2008: A Novel Approach to Precipita- 436
tion Series Completion in Climatological Datasets: Appli- 437
cation to Andalusia. – Int. J. Climatol. 28, 1525–34. DOI: 438
10.1002/joc.1657.439
Reichstein, M.,E. Falge,D. Baldocchi,D. Papale,440
M. Aubinet,P. Berbigier,C. Bernhofer, et al., 2005: 441
On the Separation of Net Ecosystem Exchange into Assim- 442
ilation and Ecosystem Respiration: Review and Improved 443
Algorithm“. – Global Change Biol. 11, 1424–39. DOI: 444
10.1111/j.1365-2486.2005.001002.x.445
Shahin, MD Abu,MD Ayub Ali,A.B.M. Shawkat Ali,446
2014: Vector Autoregression (VAR) Modeling and Fore- 447
casting of Temperature, Humidity, and Cloud Coverage. – 448
In: Computational Intelligence Techniques in Earth and En- 449
vironmental Sciences, 29–51. Springer Netherlands. DOI: 450
10.1007/978-94-017-8642-3_2.451
Stein, M.L., 2012: Interpolation of Spatial Data: Some Theory 452
for Kriging. – Springer Science & Business Media. 453
Xia, Y.,P. F ab ia n,M. Winterhalter,M. Zhao, 2001: Forest 454
climatology: estimation and use of daily climatological data 455
for Bavaria, Germany. – Agricult. Forest Meteor.106, 87–103. 456
DOI:10.1016/S0168-1923(00)00210-0.457
... Gap filling of meteorological time series is required for various applications requiring continuous data series, such as time series analysis, meteorological and climatological modelling [6]. Motivated by the industry problem stated in the previous paragraph, Fraunhofer IWES looked at the effect of data gaps in terms of bias in estimating siting parameters and how to mitigate it by correlating and filling in the gaps with data from mesoscale models [4]. ...
... The gradient boosting technique gradually improves prediction capacity by creating many models and focusing on difficult-toestimate training cases[70]. Gradient boosting has been shown to be a very effective technique for filling gaps in meteorological time series by Körner[6]. There are various advantages to using multiple linear regression or neural networks over multiple linear regression or neural networks. ...
... There are various advantages to using multiple linear regression or neural networks over multiple linear regression or neural networks. Compared to neural networks and multiple linear regression, the computations may be performed in 1/500 to 1/300 on a standard desktop PC[6]. ...
Thesis
Full-text available
The determination of site-specific wind conditions significantly influences the development and financing of offshore wind energy projects. Accurate wind resource data leads to cost-effective project financing, and in recent years, floating lidar systems (FLS) or wind lidar buoys have become common tools for offshore wind measurement. However, due to challenging offshore conditions, FLS measurements often encounter reliability issues, resulting in lower data availability than industry standards require. These systems are particularly hard to access during high wind and wave conditions, which may lead to months of missing data in the MCP (Measure-Correlate-Predict) process. This study examines whether an interim gap-filling step is necessary as part of the long-term correction procedure by using an MCP methodology. An omnidirectional linear least squares algorithm was employed to assess the data-filling algorithm's performance, analysing metrics such as MBE, MAE, and RMSE across various gap scenarios ranging from 1 to 60 days. The study finds a strong negative correlation between the root mean square error of mean wind speed during self-prediction and validation periods for all gap scenarios. The uncertainties in the gap-filling process were assessed using a jackknife approach, and long-term wind speeds were derived with and without the data-filling algorithm. The results indicate that short-term data filling before long-term correction may not be essential, as both scenarios produced similar long-term wind speed predictions. This finding aligns with industry recommendations, reinforcing an 80% minimum data availability threshold, as the mean deviation during 60-day gaps remained below 0.3%.
... while Körner et al. (2018) and Lara-Estrada et al. (2018) discovered the performance of gradient boosting and Bayesian network methods, respectively (Table 4). ...
... Concerning the software utilized, the investigated studies reported the use of various programs to perform their imputation attempts. Among them, it was concluded that the most widely used tool is the R programming language (Wesonga 2015, Demirhan and Renwick 2018, Körner et al. 2018, Williams et al. 2018, Zitouna-Chebbi et al. 2018, Meher and Das 2019, Costa et al. 2021. In addition, Schmidt et al. (2008) employed both STATISTICA and MATLAB. ...
... Therefore, sharing information regarding the run times of different imputation techniques is crucial to provide a better overview for interested researchers in similar domains. Among the papers investigated within the scope of the current research, only three studies (Lin and Wang 2014, Turrado et al. 2014, Körner et al. 2018) have comparatively reported the computational costs of the utilized algorithms. ...
Article
This study aimed to review the existing research focalizing on the missing data imputation techniques for the systems enabling actual evapotranspiration calculation (such as eddy covariance, Bowen ratio, and lysimeters) and divergent evapotranspiration related variables, i.e. temperature, wind speed, humidity, and solar radiation. Thus, the Scopus engine was utilized to scan the entire literature and 62 articles were diligently investigated. Results show classical approaches have been widely used by researchers due to their ease of implementation. However, the applicability and validity of these methods heavily rely on assumptions made about the distribution and characteristics of missing data. Hence, advanced imputation techniques produce more accurate outcomes as they handle complex and non-linear problems. Also, current trends embraced by the research community revealed that employing deep learning techniques and incorporating explainable artificial intelligence into imputations have significant potential to make insightful contributions to the body of knowledge.
... An examination of the literature highlights that there are many studies of air temperature gap filling (Aslan 2010;Boomgard-Zagrodnik and Brown 2022;Cerlini et al. 2020;Daly et al. 2000;Hartkamp et al. 1999;Gad et al. 2017;Morales-Moraga et al. 2019;Lompar et al. 2019;Padial-Iglesias et al. 2022;Pape et al. 2009;Razavi et al. 2018;Saleem and Ahmed 2016;Stahl et al. 2006;Tobin et al. 2011). Only two studies were identified that filled gaps in relative humidity (Kørner et al. 2018;Saleem et al. 2018) and none that filled gaps in leaf wetness or dewpoint pressure. Statistical techniques for spatial interpolation such as multiple regressions (Stahl et al. 2006), kriging (Garen et al. 1994Hartkamp et al. 1999;Tobin et al. 2011), inverse-distance weighting (Daly et al. 2000) and thin-plate splines (Pape et al. 2009) have also been used for the gap-filling of meteorological variables. ...
... • RF Random forests are selected as they have been found to be effective for other gap-filling tasks (al 2021), are relatively interpretable models and are well optimised in the scikit-learn package (Pedregosa et al. 2011). • LGB A gradient boosted ensemble is also tested as this class of algorithm have been noted performance in terms of reduced training time and increased accuracy in gap-filling meteorological time series (Kørner et al. 2018). Specifically the Light Gradient Boosted Machines (LGB) model has not been utilised for gap filling in other studies to the best of the authors knowledge due to its relative recency in development. ...
Article
Full-text available
Observational data of the Earth’s weather and climate at the level of ground-based weather stations are prone to gaps due to a variety of causes. These gaps can inhibit scientific research as they impede the use of numerical models for agricultural, meteorological and climatological applications as well as introducing analytic biases. In this research, different machine learning techniques are evaluated together with traditional approaches to gap filling automated weather station data. When filling gaps for a specific data stream, data from neighbouring weather stations are used in addition to reanalysis data from the European Centre for Medium-Range Weather Forecasts atmospheric reanalyses of the global climate, ERA-5 Land. A novel gap creation method is introduced that provides 100% coverage in sampling the dataset while ensuring that the sampled data are randomly distributed. Gap filling across a range of different gap lengths and target variables are compared using a range of error functions. The variables selected for modelling are mean air temperature, dew point, mean relative humidity and leaf wetness. Our results show that models perform best on gap-filling temperature and dew point with worst performance on leaf wetness. As expected, model performance decreases with increasing gap length. Comparison between machine learning and reanalysis approaches show very promising results from a number of the machine learning models.
... Recently, the method was paired with more complex algorithms other than just a simple regression [28]. Gradient boosting is usually combined with decision trees to estimate susceptibility or risk [29,30] and with regression trees in order to fill data for meteorological time series [31] or some short-term prediction [32]. More recently, further research has been performed that indicates that gradient boosting can be used to statistically postprocess weather forecast data [33] to achieve better prediction accuracy or process satellite data to estimate total precipitable water [34]. ...
Article
Full-text available
The aim of this study was to construct a high-resolution (1 km × 1 km) database of precipitation, number of wet days, and number of times precipitation exceeded 10 mm and 20 mm over Greece on a monthly and on an annual basis. In order to achieve this, the ERA5 reanalysis dataset was downscaled using regression kriging with histogram-based gradient boosting regression trees. The independent variables used are spatial parameters derived from a high-resolution digital elevation model and a selection of ERA5 reanalysis data, while as the dependent variable in the training stages, we used 97 precipitation gauges from the Hellenic National Meteorological Service for the period 1980–2010. These stations were also used for validation purposes using a leave-one-out cross-validation methodology. The results of the study showed that the algorithm is able to achieve better R2 and RMSE over the standalone ERA5 dataset over the Greek region. Additionally, the largest improvements were noticed in the wet days and in the precipitation over 10 and 20 mm, where the ERA5 reanalysis dataset overestimates the number of wet days and underestimates precipitation over 10 and 20 mm, while geographically, the ERA5 dataset performs the worst in the island regions of Greece. This indicates that the ERA5 dataset does not simulate the precipitation intensity accurately over the Greek region, and using our methodology, we were able to increase the accuracy and the resolution. Our approach delivers higher-resolution data, which are able to more accurately depict precipitation in the Greek region and are needed for comprehensive climate change hazard identification and analysis.
... With this purpose, many multivariate methods have been proposed, including hot-deck imputation, expectation maximization, predictive mean matching, least squares regression, support vector regression, gradient boosting, nearest neighbor techniques, decision tree techniques, and artificial neural networks (Andridge and Little, 2010;Bertsimas et al., 2018;Bø et al., 2004;Dempster et al., 1977;Gill et al., 2007;Honaker et al., 2009;Körner et al., 2018;Templ et al., 2011;Troyanskaya et al., 2001;Wang et al., 2006). In addition, supervised machine learning techniques can effectively represent non-linear relationships between variables measured at different spatially distributed stations (Chivers et al., 2020). ...
... A more thorough handling of GB is given by Friedman (2002). The technique often outperforms other regression techniques in similar applications, e.g. when using GB for gap-filling meteorological time series (Körner et al., 2018). However, in a study that predicted monthly climate factors from gridded data, it was shown that performance differences between regression techniques may be statistically insignificant and that more basic statistical regressions such as multiple linear regression may actually perform better in some cases (Hussein et al., 2021). ...
... Habibi Khalifeloo et al. (2015) used a linear regression model to fill water level gaps at three stations in a multivariate approach using both precipitation and water level from many stations. Körner et al. (2018) filled in some weather time series (temperature, relative humidity, wind speed) using XGboost and compared it to a neural network model and multiple linear regression. Contractor and Roughan (2021) evaluated the potential of LSTM neural networks to predict and fill gaps in daily coastal ocean time series in terms of oxygen, nutrients, and temperature. ...
Article
This paper aims to fill in the missing time series of hourly surface water levels of some stations installed along the River Seine, using the long short-term memory (LSTM) algorithm. In our study, only the water level data from the same station, containing many missing parts, were used as input and output variables, in contrast to other works where several features are available to take advantage of e.g. other station data/physical variables. A sensitive analysis is presented on both the network properties and how the input and output data are reentered to better determine the appropriate strategy. Numerous scenarios are presented, each an updated version of the previous one. Ultimately, the final version of the model can impute missing values of up to one year of hourly data with great flexibility (one-year Root-Mean-Square Error (RMSE) = 0.14 m) regardless of the location of the missing gaps in the series or their size.
Article
This research focuses on the systematic study of a Ti-6Al-2Sn-4Zr-2Mo-Si titanium alloy and the characterization of α+β (equiaxed and bimodal) and α+α′ (duplex) microstructures. It provides more insights on the outstanding advantages of the duplex (α+α′) microstructure, especially on its exceptional work hardening and strength-ductility balance. The heat treatment conditions required to form equiaxed, bimodal and duplex microstructures and their effects on the grain size and the phase proportion are discussed. It shows how the microstructural parameters can be controlled thanks to the heat treatment temperatures, the holding times and possible aging processes. The influence of such microstructural factors on the tensile properties of each alloy is investigated, especially on strength (proof stress, ultimate tensile strength), ductility (plastic elongation) and work hardening properties. The duplex (α+α′) microstructure is compared with the equiaxed and bimodal microstructures and its advantages are displayed, highlighting the better strength-ductility balance and superior work hardening properties of the duplex microstructure. Indeed, the deformed microstructure of the duplex (α+α′) microstructure reveals more homogeneous strain partitioning than that of the bimodal (α+β) microstructure. Thus, this work proved the potential of an optimized duplex (α+α′) microstructure for the enhanced tensile properties at room temperature. Finally, a machine learning model using gradient boosting regression trees is used to quantify the importance of the microstructural factors (type of microstructure, grain size and phase ratio) on the mechanical properties. Mater. Trans. 64 (2023) 111-120に掲載.3.1.4項にてFig. 7(a)と表記.試料の呼称においてDuplex(α+α′)組織,Bimodal(α+β)組織など,構成相を付記. Fullsize Image
Article
Meteorological data are useful in many fields related to climate change studies and their use often requires them to be continuous. To date, marginal distribution sampling (MDS), which consists of filling a missing value with an average of the data that are found in similar meteorological conditions over a flexible time window, is widely adopted in the FLUXNET community. In this work, we evaluate the performance of MDS at diurnal and monthly scales for the incoming shortwave radiation (Swin), relative humidity (RH), vapour pressure deficit (VPD), air and soil temperatures (Tair, Tsoil) acquired across seven sites in West Africa. The criteria tested are the MDS's ability to (i) fill gaps while reducing the error rate, (ii) represent proper variability within data and finally (iii) ensure homogeneity between its output and original data. We found during the daytime that MDS is adequate for filling gaps in Swin when both reducing error rate and a good representation of variability are targeted. If the goal is to have a small error rate, then this approach is recommended for all investigated variables except VPD. During nighttime, MDS is satisfactory to minimize the error when filling gaps in Tair, Tsoil and RH while to represent their variabilities it becomes more sensitive to the rate of missing data. At a monthly scale, the gap‐filled data are consistent with the original ones for all variables attributable to data size and a wider sliding window that allows more data under similar conditions to be considered.
Article
Full-text available
In the region of the European Alps, national and regional meteorological services operate rain-gauge networks, which together, constitute one of the densest in situ observation systems in a large-scale high-mountain region. Data from these networks are consistently analyzed, in this study, to develop a pan-Alpine grid dataset and to describe the region's mesoscale precipitation climate, including the occurrence of heavy precipitation and long dry periods. The analyses are based on a collation of high-resolution rain-gauge data from seven Alpine countries, with 5500 measurements per day on average, spanning the period 1971–2008. The dataset is an update of an earlier version with improved data density and more thorough quality control. The grid dataset has a grid spacing of 5 km, daily time resolution, and was constructed with a distance-angular weighting scheme that integrates climatological precipitation–topography relationships. Scales effectively resolved in the dataset are coarser than the grid spacing and vary in time and space, depending on station density. We quantify the uncertainty of the dataset by cross-validation and in relation to topographic complexity, data density and season. Results indicate that grid point estimates are systematically underestimated (overestimated) at large (small) precipitation intensities, when they are interpreted as point estimates. Our climatological analyses highlight interesting variations in indicators of daily precipitation that deviate from the pattern and course of mean precipitation and illustrate the complex role of topography. The daily Alpine precipitation grid dataset was developed as part of the EU funded EURO4M project and is freely available for scientific use.
Article
Full-text available
This paper discusses the advantages and disadvantages of the different methods that separate net ecosystem exchange (NEE) into its major components, gross ecosystem carbon uptake (GEP) and ecosystem respiration ( R eco ). In particular, we analyse the effect of the extrapolation of night‐time values of ecosystem respiration into the daytime; this is usually done with a temperature response function that is derived from long‐term data sets. For this analysis, we used 16 one‐year‐long data sets of carbon dioxide exchange measurements from European and US‐American eddy covariance networks. These sites span from the boreal to Mediterranean climates, and include deciduous and evergreen forest, scrubland and crop ecosystems. We show that the temperature sensitivity of R eco , derived from long‐term (annual) data sets, does not reflect the short‐term temperature sensitivity that is effective when extrapolating from night‐ to daytime. Specifically, in summer active ecosystems the long‐term temperature sensitivity exceeds the short‐term sensitivity. Thus, in those ecosystems, the application of a long‐term temperature sensitivity to the extrapolation of respiration from night to day leads to a systematic overestimation of ecosystem respiration from half‐hourly to annual time‐scales, which can reach >25% for an annual budget and which consequently affects estimates of GEP. Conversely, in summer passive (Mediterranean) ecosystems, the long‐term temperature sensitivity is lower than the short‐term temperature sensitivity resulting in underestimation of annual sums of respiration. We introduce a new generic algorithm that derives a short‐term temperature sensitivity of R eco from eddy covariance data that applies this to the extrapolation from night‐ to daytime, and that further performs a filling of data gaps that exploits both, the covariance between fluxes and meteorological drivers and the temporal structure of the fluxes. While this algorithm should give less biased estimates of GEP and R eco , we discuss the remaining biases and recommend that eddy covariance measurements are still backed by ancillary flux measurements that can reduce the uncertainties inherent in the eddy covariance data.
Conference Paper
Full-text available
Meteorological data (wind speed, wind direction, rainfall, temperature etc) are an essential parameter for energy applications studies and development. Weather data is subject to different types of errors. The most commonly observed problems in temperature data embrace missing observations, unreasonable readings, spurious zeroes, and so on. Therefore, the data must be cleaned - that is, the errors and omissions must be corrected. In this research, the methodology adopted is to discard certain observed values and treat them as 'missing data'. We then examine and analyse the imputation accuracy of different interpolation techniques and filling methods for missing historical records of temperature data. The performance of these techniques as predictors of the missing values is evaluated using standard statistical indicator, such as correlation coefficient, root mean squared error, etc
Chapter
Climate change is a global phenomenon but its implications are distinctively local. The climatic variables include temperature, rainfall, humidity, wind speed, cloud coverage, and bright sunshine. The study of behavior of the climatic variables is very important for understanding the future changes among the climatic variables and implementing important policies. The problem is how to study the past, present, and future behaviors of the climatic variables. The purpose of the present study was to develop an appropriate vector autoregression (VAR) model for forecasting monthly temperature, humidity, and cloud coverage of Rajshahi district in Bangladesh. The test for stationarity of the time series variables has been confirmed with augmented Dickey-Fuller, Phillips-Perron, and Kwiatkowski-Phillips-Schmidt-Shin tests. The endogenity among the variables was examined by F-statistic proposed by C.W.J. Granger. The order of the VAR model was selected using Akaike information criterion, Schwarz information criteria, Hannan-Quinn information criteria, final prediction error, and likelihood ratio test. The ordinary least square method was used to estimate the parameters of the model. The VAR(8) model was found to be the best. Structural analyses were performed using forecast error variance decomposition and impulse response function. These structural analyses divulged that the temperature, humidity, and cloud coverage would be interrelated and endogenous in future. Finally, temperature, humidity, and cloud coverage were forecasted from January 2011 to December 2016 using the best selected model VAR(8). The forecasted values showed an upward trend in temperature and humidity and downward trend in cloud coverage. Therefore, we must show our friendly behavior to the environment to control such trends. © Springer Science+Business Media Dordrecht 2014. All rights are reserved.
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
Boosting is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
Article
Complete monthly series of precipitation data have been generated for 932 stations run by the Instituto Nacional de Meteorología (INM) in Andalusia (Andalucía, southern region of Spain) and some adjacent areas for the period 1961–2000 using a new method for the estimation of missing data. The proposed method fills in the gaps in the climatological series for a target station using precipitation data from neighbouring stations with a similar precipitation pattern. Particular emphasis is made in the consideration of the precipitation data uncertainty. A set of possible estimates for the missing data are obtained assuming a linear relationship between the target station and the selected neighbour stations. The final imputed value is the median of the previously evaluated estimates distribution. This approach also provides the uncertainty of the missing data estimate. As an application, the suggested approach is used to obtain a precipitation map of Andalusia, as well as the average annual series for the examined period. Copyright © 2008 Royal Meteorological Society
Article
Long-term daily climatological data are important to study forest damage and simulate tree growth, but they are scarce in forested areas because observational records are often scarce. At the eight Bavarian forest sites, we examined six interpolation methods and established empirical transfer functions relating observed forest climate data to the climate data of German weather stations. Based on our comparisons, a technique is developed and used to reconstruct 31-year daily forest climatological data at six forest sites. The reconstructed climate data were then used to calculate four meteorological stress factors which are of importance to forest studies. Finally, the uncertainties, temporal and spatial variation of these stress factors were discussed.