Available via license: CC BY 4.0
Content may be subject to copyright.
Region-Conditioned Orthogonal 3D U-Net
for Weather4Cast Competition
Taehyeon Kim Shinhwan Kang Hyeonjeong Shin Deukryeol Yoon
Seongha Eom Kijung Shin Se-Young Yun
KAIST AI
Seoul, Korea
{potter32, shinhwan.kang, hyeonjeong1, deukryeol.yoon}@kaist.ac.kr
{doubleb, kijungs, yunseyoung}@kaist.ac.kr
Abstract
The Weather4Cast competition (hosted by NeurIPS 2022) required competitors
to predict super-resolution rain movies in various regions of Europe when low-
resolution satellite contexts covering wider regions are given. In this paper, we
show that a general baseline 3D U-Net can be significantly improved with region-
conditioned layers as well as orthogonality regularizations on 1
×
1
×
1 convolutional
layers. Additionally, we facilitate the generalization with a bag of training strategies:
mixup data augmentation, self-distillation, and feature-wise linear modulation
(FiLM). Presented modifications outperform the baseline algorithms (3D U-Net)
by up to
19.54%
with less than 1% additional parameters, which won the 4th place
in the core test leaderboard.
1 Introduction
Precipitation forecasting is one of the most arduous problem in forecasting the meteorological
conditions such as air quality, solar, temperature, and wind velocity. Accurate forecasting can prevent
enormous economic and social damages from a variety of applications: large-scale crop management,
autonomous driving systems, and air traffic control. While Numerical Weather Prediction (NWP)
is a general method for predicting the climate changes based on the calculation of physics-based
simulations, its performance for short-term rain prediction (i.e., less than 6 hours) is still inaccurate
despite lots of computational efforts. Recently, deep learning techniques have attracted huge attention
from the weather research community for such short-term precipitation forecasting [
9
,
10
,
13
,
14
,
5
].
Specifically, among these techniques, Espeholt et al.
[5]
develop an end-to-end deep learning method
that outperforms High Resolution Rapid Refresh (HRRR) [
3
], which is the start-of-the-art method
used in United States.
Weather4Cast 2022 [
6
,
7
] is a competition for designing the best deep learning-based precipitation
forecasting model where competitors attempted to forecast super-resolution rainfall events for the next
8 hours at 15-minute intervals from low-resolution satellite radiances over various regions in Europe.
For the stage2 task in which our team participated, the desired model for competitors is to predict the
7 Europe regions across two years (2019 and 2020) when the training dataset is composed of spectral
satellite imagery which covers larger areas with low-resolutions having 11 input variables for each
pixel. While the satellite imagery data demands for the spatio-temporal modelling, the studies for
conventional methods are still under-explored for being robust towards such spatio-temporal shifts.
To tackle the challenge of spatio-temporal shifts, we propose a
R
egion
C
onditioned
N
etwork (RCN)
to inject the regional information into the output of 3D residual U-Net’s encoder architecture, which
is the variation of 3D U-Net [
4
]. With given spectral satellite contexts of different regions, RCN
can extract the region-conditioned context and such contexts linearly modulate the output of the 3D
36th Weather4cast NeurIPS 2022Competition Workshop
arXiv:2212.02059v1 [cs.CV] 5 Dec 2022
U-Net. In addition, by penalizing the orthogonality regularization for the 1
×
1
×
1 convolutional layer,
the network can capture more fined-grained representations for the super-resolution prediction so that
it yields the better score. We also stabilize the training from the spatio-temporal shifts of the dataset.
Lastly, we adapt a bag of training strategies such as mixup [
15
], self-distillation, and feature-wise
linear modulation (FiLM) [
2
,
11
]. More precisely, we add FiLM layers to a backbone model for
fine-tuning the layers for each region of each year while freezing other backbones except the FiLM
layers. We provide more details about RCN and training strategies in Section 3.
Our contributions can be summarized as follows:
•Effective:
We utilize two concepts: (1) region-conditioned network and (2) orthogonality reg-
ularization on 1
×
1
×
1 convolutional layers. With additional training strategies, our solution
outperforms a baseline up to 19.54% with less than 1% additional parameters.
•Applicable: Our approaches can be adapted to any other deep neural networks.
•Reproducible: We provide our source code at [1].
2 Overview of Weather4Cast Challenge and Provided Data
2.1 Weather4Cast Challenge
The main objective of Weather4Cast competition is to predict future super-resolution rainfall events
(i.e., rain or no-rain) from lower-resolution satellite radiance. In this competition, competitors are
required to provide a model predicting rainfall events until eight hours in 32 time slots for given 4
time slots of a proceeding hour. As the given data is composed of multiple regions in Europe across
two years, a key is to learn the robust model under spatio-temporal shifts.
The challenge comprised two different tasks: (1)
stage1
: predicting 3 different regions for 1 year
(2019) and (2)
stage2
: 7 different regions for 2 years (2019 and 2020). Additionally, the rain rate
threshold for the latter task is 0.2 while the former one is 0.0001. The solution for the stage2 task can
bring beneficial meteorological meanings while it is a harder challenge which increases the sparsity
of the rain events to be predicted. In this paper, our solution targets for the stage2 task.
2.2 Dataset
The dataset is provided with satellite imagery including 11 observed physics-information, positional
information, and observed rainfall amounts. The detailed explanations are as follows:
•Regions:
The dataset consists of 3 different regions for 1 year (2019) in stage1, and it is extended
to 7 regions for 2 years (2019 and 2020) in stage2.
•Input variables:
Each spectral satellite imagery include 11 variables which are slightly noisy
satellite radiances covering visible, water vapor, and infrared bands: IR_016, IR_039, IR_087,
IR_097, IR_108, IR_120, IR_134, VIS006, VIS008, WV_062, and WV_073. Detailed information
for each context is not provided.
•Sequential information:
Each input image covers 15 minutes where each pixel corresponds to
12km ×12km area while each pixel for the output indicates 2km ×2km area.
•Rainfall amount:
Pixel-wise rainfall information is provided as a float value. The precipitation
ratio is in Table 1.
•Static information:
Metadata contains the information of latitude, longitude, and height for each
pixel.
2.3 Evaluation Metrics
Table 1: Statistics over different regions:
boxi0015, boxi0034, and boxi0076.
Region boxi0015 boxi0034 boxi0076
No-rain 0.810 0.810 0.892
Rain 0.190 0.190 0.108
We evaluate the predictive performance in terms of
Critical Success Index (CSI) score [
12
], F1-score,
accuracy, and Intersection over Union (IoU). In par-
ticular, CSI-score is the common evaluation metric
in precipitation forecasting. It is the total number
2
Residual U-Net
Orthogonal
1x1x1 Convolution
Region-Conditioned
Context
Mixup
Self-Distillation
FiLM
Transf er Thresholding
Step 1. Train the Backbone Step 2. Fine-Tun e Step 3. Predict
Figure 1: An overview of our solution in the task of predicting the 7 regions at 2019 and 2020.
of correct event forecasts divided by the sum of the total number of storm forecasts and the number
of misses, i.e.,
CSI =TP
TP +FN +FP (1)
where TP, FN, and FP are true positive, false negative, and false positive, respectively.
3 Method
This section provides the solution of KAIST AI. Overall training consists of 3 steps: (1) train
the backbone (Residual U-Net) with orthogonal 1
×
1
×
1 convolutional layers as well as region-
conditioned, (2) fine-tune the backbone with FiLM Transfer approach [
2
] for each region of a certain
year, and (3) predict the output via thresholding(Figure 1).
3.1 Baseline: 3D U-Net
We choose the baseline model as a 3D U-Net [
4
], which utilizes the same layers with the convolu-
tional encoder-decoder architecture for the volumetric segmentation task, on the region ‘boxi0015’,
‘boxi0034’, ‘boxi0076’ in 2019 with DiceBCEloss [
16
]. As Table 2 shows, the baseline performance
can be improved more as the batch size increases. Interestingly, such baseline models make fairly
accurate predictions for different regions even without the use of region information as an input.
However, to make super-resolution predictions more accurate, conditioning for regions is needed
during the propagation.
Table 2: A preliminary survey of the 3D U-Net architecture on the validation dataset. We set the
regions to ‘boxi0015’, ‘boxi0034’, and ‘boxi0076’ in 2019.
Batch size CSI-score F1-score Accuracy IoU Loss Precision Recall
16 0.3130 0.4668 0.7302 0.3130 0.7670 0.3310 0.8321
32 0.3221 0.4767 0.7499 0.3221 0.7638 0.3457 0.8043
48 0.3303 0.4934 0.7348 0.3303 0.7629 0.3454 0.8790
3.2 Region Conditioning
To inject region information into feature maps during the propagation, we propose a new
R
egion
C
onditioned
N
etwork (RCN) to generate region-conditioned context(Figure 2). RCN is a method of
adding an auxiliary region conditioner by using two layered fully-connected networks with a ReLU
activation function. Here, we transform the region categorical variables into one-hot vector. Because
the given dataset is comprised of satellite contexts for 7 different regions in Europe, the length of
one-hot vector is 7. We extract the region-conditioned contexts including scale
γ
vector and bias
β
vector as an output of RCN with a categorical input and formulate the feature map as follows:
γ, β ←REGION CONDITIONED NETWORK(xr)
˜xr←γxr+β(2)
where
xr
is the last representation output of the encoder architecture. For the detailed computation of
the ˜xr,γis element-wisely multiplied with xrin a pointwise manner (), and βis added similarly.
3
Residual Convolutional Unit
+
+
++
+
+
+
+
+
++
+
Region
Region-Conditioned Context
Orthogonal
1x1x1 convolution
FiLM
Trans fer
Residual Tra ns -Convolutional Unit
Downsampling
Upsampling
3D Satellite
Context
Rain
Predictions
Details for FiLM layer
+
X
Scale γBias 𝜷
𝑥 "𝑥
"𝑥 = 𝛾𝑥 + 𝛽
Scale γ
Bias 𝜷
X
Figure 2: An overview of our modified U-Net architecture. Each blue box corresponds to the residual
convolutional unit and each green block denotes the residual transposed convolutional unit. During
the propagation, the region-conditioned context is added to the last output of the encoder while the
shortcut from the encoder unit to the corresponding decoder unit is transformed with orthogonal
3D 1
×
1
×
1 convolutional opertors as well as FiLM layer. The arrow denotes the propagation of a
multi-channel feature map.
3.3 Orthogonal 1×1×1 Convolution and Residual Unit
Orthogonal convolutional kernel is a class of the advanced normalization techniques to preserve
the magnitude of the propagation signal as well as to reduce the redundant features in the filter
response. Because there is the difference of resolution size between input and output, it is needed
to capture more fine-grained features from the latent representations. Inspired by the orthogonal
concept, we alleviate such issue by adding 1
×
1
×
1 convolution into the path from the encoder block
to the corresponding decoder block (Figure 2) and making those 1
×
1
×
1 convolutions soft orthogonal
with the orthogoanlity regularization, termed as Spectral Restricted Isometry Property
+
(SRIP
+
)
referring to as Kim and Yun [8], as follows:
λ
|W| X
W∈W
σ(W>W−In)(3)
where
W
is a weight matrix of 1
×
1
×
1 convolutional kernel,
W
is a set of 1
×
1
×
1 convolutional
kernel’s weight matrices, and σ(W>W−In)is the output of the power method.
u←(W>W−In)v, v ←(W>W−In)u,
σ(W>W−In)←kvk
kuk.(4)
where the vector
v∈Rn
is randomly initialized with normal distribution. The key difference
from Kim and Yun
[8]
is whether the dimension of the penalized weight is 5D convolution or 4D
convolution. Although we can not quantitatively/qualitatively confirm the visible latent feature-
map changes through orthogonality, we observe the improvement in CSI-score performance in the
validation set.
Residual Unit.
We design a 3D-Residual U-Net, which is a variant of the baseline 3D U-
Net [
4
] (Figure 2). The main difference between the baseline and ours is the block type. We
make a shortcut for each encoder and decoder block while an 1
×
1
×
1 convolutional layer is added if
there is the difference of number of channels between input and output.
4
3.4 Data Augmentation: Mixup
Since satellite imagery datasets rarely contain rainfall data, compared to the abundant non-rainfall
data, the model is easily biased towards the majority class, i.e., non-rainfall. To mitigate the bias
on the majority class and encourage elaborated classification on the minority class, we applied
Mixup [
15
], a popular data augmentation technique. Mixup regularizes neural networks by utilizing
the convex combination of training data without large computational overhead, and the effectiveness
is proved over various image classification tasks and semantic segmentation tasks [
15
]. Generally,
mixup is utilized on 4D datasets, i.e., (
xi
,
yi
)
∈RB×C×H×W
where
B
is batch size,
C
is the number
of channels,
H
is the height, and
W
is the width, while it is under-explored on 5D dataset which
time-dimension is added. We formulate the augmented training data in the same manner for the
general Mixup, and the details are as follows:
˜x=λxi+ (1 −λ)xj,
˜y=λyi+ (1 −λ)yj,
where (
xi
,
yi
)
∈RB×C×T×H×W
are training data and ground truth target, (
xj
,
yj
) are randomly
shuffled data of (
xi
,
yi
), and
λ∼
Beta(
α
,
α
). We fixed
α= 1
after exploring some values. Interest-
ingly, as seen in Table 3, the model utilizing mixup achieved the visible performance improvement in
terms of F1-score, CSI-score, and IoU.
Table 3: Comparison of baseline and model utilizing mixup on the validation dataset.
Methods CSI-score F1-score Accuracy IoU Loss Precision Recall
Baseline (best) 0.3303 0.4934 0.7348 0.3303 0.7629 0.3454 0.8790
Mixup 0.3699 0.5397 0.7340 0.3699 0.8024 0.3877 0.8926
3.5 Self-Distillation
After training the region-conditioned backbone, it is re-trained with self-distillation, without the
supervision of ground-truth labels. Since the ground-truth observation is sparse and noisy, its training
is unstable as well as over-confident. To release this concern, we apply the self-distillation loss which
is a well-known technique for smoothing the loss landscape to lead to a flatten optima.
3.6 Feature-wise Linear Modulation (FiLM) Layer for further Fine-Tuning
As the last step, the self-distilled model is fine-tuned with each regional data for each year (i.e., we
have the 7
×
2 architectures). Because the pre-trained model can be under-performed for a specific
region (or year) while it captures the general features for all regions, the pre-trained model needs
further updates for personalization. For fine-tuning, we use the FiLM Transfer (FiT) inspired by
[
2
], which fixes the pre-trained backbone and fine-tunes only FiLM adapter layers (Figure 2). We
initialize the new learnable parameters for linear modulation: scale
γf
and bias
βf
and modify the
latent representations of the shortcut path from encoder to decoder during fine-tuning:
˜xr←γfxr+βf(5)
where the detailed computation of the
˜xr
is the same with Equation
(2)
. Here, we do not use an
auxiliary network like RCN, but directly use scale and bias parameters as learnable parameters.
3.7 Thresholding
We generate the optimal precipitation output by controlling the threshold for each region across a
year. Generally, the model decides positive rain if the corresponding probability is over 0.5. However,
because all regions have different precipitation distributions, i.e., different precipitation scales, this
characteristic leads to a sub-optimal due to different scales of certainty on each region, even after
the FiLM Transfer. Thresholding approach is a ubiquitous technique to tune the prediction in a
post-processing manner [
5
]. More precisely, given
p∈[0,1]
, the points with probability higher
than
p
is decided as positive rain. We explore the best threshold with fixing bin into 0.1, i.e.,
5
p= 0.1,0.2,· · · ,0.9
. As a result, the relaxed threshold enhances the rain generalization across
several regions, and the best performance can be achieved with the combination of thresholds
p= 0.1, p = 0.2,
and
p= 0.4
for different regions. Especially, ‘boxi0076’ region, which has little
precipitation observations, is significantly improved.
3.8 Training Details and Leaderboard Results
Table 4 includes the value of hyperparameters used in this work. Through the combination of our
proposed methods, we can achieve more generalized score(Table 5).
Table 4: Hyperparameter settings.
Hyperparameter Optimizer Learning rate Maximum epoch Dropout rate Patience Batch Size
Value AdamW 1e-4 90 0.4 40 56
Table 5: The leaderboard score (i.e., IoU) of our solution for stage2 compared to the baseline score
submitted by the organizer.
Task Region boxi0015 boxi0034 boxi0076 roxi0004
Year 2019 2020 2019 2020 2019 2020 2019 2020
Core Test Baseline 0.223 0.163 0.149 0.298 0.268 0.070 0.204 0.296
Ours 0.270 0.237 0.200 0.362 0.294 0.104 0.234 0.321
Core Heldout Baseline 0.299 0.193 0.243 0.238 0.155 0.369 0.272 0.251
Ours 0.328 0.210 0.279 0.237 0.166 0.328 0.294 0.311
Task Region roxi0005 roxi0006 roxi0007 Overall
Year 2019 2020 2019 2020 2019 2020
Core Test Baseline 0.240 0.228 0.341 0.274 0.327 0.089 0.226
Ours 0.281 0.272 0.384 0.336 0.361 0.133 0.271
Core Heldout Baseline 0.274 0.299 0.325 0.403 0.245 0.005 0.255
Ours 0.287 0.318 0.389 0.417 0.248 0.028 0.274
4 Conclusion
In this work, we propose the region-conditioned orthogonal residual U-Net model for precipitation
forecasting. The performance of the proposed model outperforms the 3D U-Net model by up to
19.54%
. Our contributions can be folded into three perspectives. Firstly, we renovate the original 3D
U-Net with effective techniques, that are region-conditioned layers and orthogonality regularization
on 1
×
1
×
1 convolutional layers. Next, the generalization capability of the model can be enhanced
by utilizing common training techniques such as mixup data augmentation, self-distillation, and
feature-wise linear modulation. Lastly, our solution can be easily reproduced, and thus our repository
can offer a great entry point to facilitate future developments in rainfall forecasting for data scientists.
These interesting approaches are not limited to 3D U-Net, but could be applied to other architectures
designed for precipitation forecasting, such as ConvLSTM and MetNet.
Acknowledgement
This work was supported by the Korea Meteorological Administration Research and Development
Program "Development of AI techniques for Weather Forecasting" under Grant (KMA2021-00121)
and Institute of Information & communications Technology Planning & Evaluation (IITP) grant
funded by the Korea government(MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School
Program(KAIST)).
6
References
[1]
Our implementation code.
https://github.com/hyeonjeong1/
22-Neurips-Competition-Baseline, 2022.
[2]
Anonymous. Fit: Parameter efficient few-shot transfer learning for personalized and federated
image classification. In Submitted to The Eleventh International Conference on Learning
Representations, 2023. URL
https://openreview.net/forum?id=9aokcgBVIj1
. under
review.
[3]
Stanley G Benjamin, Stephen S Weygandt, John M Brown, Ming Hu, Curtis R Alexander,
Tatiana G Smirnova, Joseph B Olson, Eric P James, David C Dowell, Georg A Grell, et al.
A north american hourly assimilation and model forecast cycle: The rapid refresh. Monthly
Weather Review, 144(4):1669–1694, 2016.
[4]
Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger.
3d u-net: learning dense volumetric segmentation from sparse annotation. In International
conference on medical image computing and computer-assisted intervention, pages 424–432.
Springer, 2016.
[5]
Lasse Espeholt, Shreya Agrawal, Casper Sønderby, Manoj Kumar, Jonathan Heek, Carla
Bromberg, Cenk Gazen, Rob Carver, Marcin Andrychowicz, Jason Hickey, et al. Deep learning
for twelve hour precipitation forecasts. Nature communications, 13(1):1–10, 2022.
[6]
Aleksandra Gruca, Pedro Herruzo, Pilar Rípodas, Andrzej Kucik, Christian Briese, Michael K.
Kopp, Sepp Hochreiter, Pedram Ghamisi, and David P. Kreil. CDCEO’21 - First Workshop on
Complex Data Challenges in Earth Observation, page 4878–4879. Association for Computing
Machinery, New York, NY, USA, 2021. ISBN 9781450384469. URL
https://doi.org/10.
1145/3459637.3482044.
[7]
Pedro Herruzo, Aleksandra Gruca, Llorenç Lliso, Xavier Calbet, Pilar Rípodas, Sepp Hochre-
iter, Michael Kopp, and David P. Kreil. High-resolution multi-channel weather forecast-
ing – first insights on transfer learning from the weather4cast competitions 2021. In 2021
IEEE International Conference on Big Data (Big Data), pages 5750–5757, 2021. doi:
10.1109/BigData52589.2021.9672063.
[8]
Taehyeon Kim and Se-Young Yun. Revisiting orthogonality regularization: A study for con-
volutional neural networks in image classification. IEEE Access, 10:69741–69749, 2022. doi:
10.1109/ACCESS.2022.3185621.
[9]
Jihoon Ko, Kyuhan Lee, Hyunjin Hwang, Seok-Geun Oh, Seok-Woo Son, and Kijung Shin.
Effective training strategies for deep-learning-based precipitation nowcasting and estimation.
Computers & Geosciences, 161:105072, 2022.
[10]
Jihoon Ko, Kyuhan Lee, Hyunjin Hwang, and Kijung Shin. Deep-learning-based precipitation
nowcasting with ground weather station data and radar data. arXiv preprint arXiv:2210.12853,
2022.
[11]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:
Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 32, 2018.
[12]
Joseph T Schaefer. The critical success index as an indicator of warning skill. Weather and
forecasting, 5(4):570–575, 1990.
[13]
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun
Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.
Advances in neural information processing systems, 28, 2015.
[14]
Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and
Wang-chun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model.
Advances in neural information processing systems, 30, 2017.
7
[15]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond
empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[16]
Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang.
Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in
medical image analysis and multimodal learning for clinical decision support, pages 3–11.
Springer, 2018.
8