PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised data due to the limited availability of resources and time. In light of this, a significant amount of research has been conducted in the area of adopting large pre-trained datasets for diverse downstream tasks via fine tuning, linear probing, or prompt tuning in low resource settings. Normalization techniques are essential for accelerating training and improving the generalization of deep neural networks and have been successfully used in a wide variety of applications. A lot of normalization techniques have been proposed but the success of normalization in low resource downstream NLP and speech tasks is limited. One of the reasons is the inability to capture expressiveness by rescaling parameters of normalization. We propose KullbackLeibler(KL) Regularized normalization (KL-Norm) which make the normalized data well behaved and helps in better generalization as it reduces over-fitting, generalises well on out of domain distributions and removes irrelevant biases and features with negligible increase in model parameters and memory overheads. Detailed experimental evaluation on multiple low resource NLP and speech tasks, demonstrates the superior performance of KL-Norm as compared to other popular normalization and regularization techniques.
Content may be subject to copyright.
KL Regularized Normalization Framework for Low Resource Tasks
Neeraj Kumar
IIT Delhi
neerajkr2k14@gmail.com
Ankur Narang
IEEE Senior Member
annarang2@gmail.com
Brejesh Lall
IIT Delhi
brejesh@ee.iitd.ac.in
Abstract
Large pre-trained models, such as Bert, GPT, and
Wav2Vec, have demonstrated great potential for learning
representations that are transferable to a wide variety of
downstream tasks . It is difficult to obtain a large quantity
of supervised data due to the limited availability of resources
and time. In light of this, a significant amount of research
has been conducted in the area of adopting large pre-trained
datasets for diverse downstream tasks via fine tuning, linear
probing, or prompt tuning in low resource settings.
Normalization techniques are essential for accelerating
training and improving the generalization of deep neural
networks and have been successfully used in a wide vari-
ety of applications. A lot of normalization techniques have
been proposed but the success of normalization in low re-
source downstream NLP and speech tasks is limited. One of
the reasons is the inability to capture expressiveness by re-
scaling parameters of normalization. We propose Kullback-
Leibler(KL) Regularized normalization (KL-Norm) which
make the normalized data well behaved and helps in better
generalization as it reduces over-fitting, generalises well on
out of domain distributions and removes irrelevant biases
and features with negligible increase in model parameters
and memory overheads. Detailed experimental evaluation on
multiple low resource NLP and speech tasks, demonstrates
the superior performance of KL-Norm as compared to other
popular normalization and regularization techniques.
1. Introduction
Due to breakthroughs in optimization techniques, big
datasets, and streamlined designs of deep neural architectures,
deep learning (DL) has gained significant success in a
variety of domains. Nevertheless, deep learning is renowned
for requiring huge labelled datasets, which restricts the
scalability of a deep model due to the expense of annotation.
Early research in this subject utilised data augmentation and
regularisation approaches to mitigate the overfitting issue
caused by a lack of data, but only to a limited degree.
With the coming of large pre-trained models in NLP
such as Bert [8], GPT [44], Wav2Vec [3], Whisper [39], etc,
low resource tasks have become an active area of research
and have lots of industry use-cases such as manufacturing,
gaming, metaverse, etc. A lot of techniques such as fine
tuning, linear probing [25], prompt tuning [58] have been
explored to improve the performance of these model in low
resource settings. No work has been done in the direction of
normalization to make it work better for low resource tasks.
The normalization method is one of the fundamental con-
tributions to the deep learning community. It is a method
of adaptive reparametrization, motivated by the difficulty of
training very deep neural networks [15]. Batch Normaliza-
tion [21] is the first normalization technique proposed to pre-
vent the training from getting stuck in the saturated regimes of
non-linearities and improves the training speed. It gives regu-
larization effects which leads to better generalization of deep
neural networks. It promotes larger learning rate and reduces
the photometric distortions as batch normalized networks
train faster and observe each training example fewer times.
Beyond Batch Normalization, several normalization
techniques have been proposed which work better on various
applications such as stylization [20, 37,50] , recurrent neural
networks [2, 7], object detection [54] and faster conver-
gence [47], image to image translation [41], neural acoustic
modeling [26] and vision related applications [28, 33]. Such
techniques have gained success in all the fields of artificial
intelligence such as NLP, vision, speech and others.
In the traditional batch normalization operation, normal-
ization is done by mean shifting of feature map along with
making it the unit variance. To increase the expressivity of
network , we use two learnable parameters namely scale
γ
and bias βas shown in Equation (5).
z=γxµ
Σ+β(1)
Assuming the input data is Gaussian, after passing through
the batch normalization, it becomes the unit gaussian and
to increase the representative power, we are adding the
learnable parameters to shift its mean and increase it variance
according to the deep learning tasks.
Now if we expand the Equation
(5)
and combine the scale
1
arXiv:2212.11275v1 [cs.CL] 21 Dec 2022
and shift with mean and variance, we will get the general
equation of normalization(Equation (3)).
z=γx
Σ+(βγµ
Σ)(2)
z=αx+ζ(3)
Normalization makes the data well behaved by making
it unit gaussian such that it helps in faster optimization, larger
learning rate and implicit regularization. Equation
(3)
is
the general formulation of normalization where the
α
and
ζ
incorporates the mean and variance which results in the batch
normalization as shown in equation
(5)
. There are various
ways to make the normalization framework well behaved and
effective for deep learning training with different formulation
of αand ζ.
We are looking into the normalization framework that
works well in low resource setting as it has a lot of industry
and research relevance. High resource supervised datasets are
time and resource consuming and sometimes impossible to
scale in industry set up. With the coming of large pre-trained
models such as Bert, Wav2Vec, etc. it is possible to achieve
good results in low resource setting.
Current normalization approaches such as batch normal-
ization, layer normalization have not been effective in low
resource set up to make data well behaved, increasing the
expressive power of the network and better generalization
as shown in experimental section. No work has been done
in making the data well behaved through normalization in
low resource setting.
In this regard, we propose
KL-Norm
KL Regularized
normalization framework which imposes the prior on the
normalization framework through the KL divergence loss to
follow gaussian distribution. This has shown promising result
in low resource tasks. KL Regularized normalization helps to
improve accuracy as KL regularization based method acts as
an ensemble of data. It helps in better training generalization
as it reduces overfitting by adding a regularization loss func-
tion in the training schedule. It generalizes well on out of do-
main datasets as compared to other normalization techniques.
It filters relevant features and removes the superficial features
or biases present in the dataset or pre-trained model. An
overview of the proposed normalization mechanism is shown
in Figure 2. We specifically make the following contributions:
We propose a novel KL Regularized normalization
framework (KL-Norm) which incorporates rescaling
parameter computation by considering regularization
loss (Section 3).
KL-Norm demonstrates better expressiveness due to
regularization loss and generalizes well by reducing
over-fitting. It incorporates uncertainty which promotes
better out of domain data generalization.
Detailed experimental analysis demonstrates superior
accuracy and performance of KL-Norm as compared to
other normalization techniques, on low resource down-
stream NLP tasks including: sentiment classification,
characterizing semantic relationships, semantic textual
similarity, textual entailment and paraphrase detection
as well as downstream speech task such as keyword
detection and emotion classification.
2. Related Work
We have divided this section into two parts namely low
resource NLP and normalization techniques.
Low resource NLP and Speech
Earlier works in low re-
source include feature engineering which requires significant
efforts while adapting to the new datasets [49]. Other work
is transferring knowledge across the domain to increase the
data require for training [59]. Adversarial training [16] is one
of these approach that uses the knowledge of domain where
plentiful data is present and does the out of domain adaptation
on low resource data [12]. However, these approaches have
not used a pre-trained generic language model, but perform
pre-training for each task individually. Another set of low
resource training involves using language model [6,22]. [19]
showed that a classifier that fine-tunes a pre-trained BERT
model generally has wider optima on the training loss curves
in comparison to models trained for the same task from
scratch, indicating a more general classifier. In the speech
domain, pre-trained architectures such as DeepeSpeech [18]
and Wav2Vec [3] have been explored for various classification
task as intent [55] , phoneme [3], speaker and language iden-
tification [11]. Another approach of linear probing is used
in large pre-trained model where the linear layer is added on
top of pre-trained model with fine tuning the linear layer only.
Prompt tuning has gained a lot of attention after the coming
of GPT-3 model where the discrete or continuous prompts
are used to predict the tasks by the large pre-trained model.
However, none of them studied the impact of normalization
based on KL regularization in low resource setting.
Normalization techniques
Batch normalization(BN)
is the first form for normalization techniques proposed
in [21] to normalize the feature maps by computing the
batch statistics which helps in training the deep neural
networks faster. Normalization works better as the first
order optimization algorithms such as SGD works better on
isotropic landscape( [38]).
Motivated by this , a lot of normalization techniques have
been proposed to deal different scenarios.Layer Normal-
ization(LN) ( [2]) and Recurrent batch normalization( [7])
give better performance in recurrent deep learning models.
Instance Normalization(IN) [50] and Adaptive Instance
Normalization [20] helps in image stylization , Group
Normalization [54] improves performance in object detection
,Weight Normalization [47] speeds up the convergence by
reparametrization of weight vectors in a neural network
that decouples the length of those weight vectors from their
direction.Batch-Instance Normalization(BIN) [37] controls
the styles adaptively to the task and selectively to individual
feature maps. SPADE [41] makes this denormalization
spatially sensitive. SPADE normalization boils down to
”conditional batch normalization which varies on a per-pixel
basis”. Switchable Normalization [33] are dynamically select
BN,LN and IN in proportion and works better in vision tasks.
Stochastic Normalization [28] gives regulaization effects
and better on vision tasks. [23] provides a meta learning
mechanism for instance-level normalization techniques. [26]
generates the rescaling parameters by different speakers
and environments for adaptive neural acoustic modeling
via Layer Normalization.The proposed KL-Norm uses KL
Regularized inference based affine parameters to show the
expressive power on low resource NLP downstraemed tasks.
3. Theoretical Foundation of KL Regularized
Normalization
3.1. Preliminaries : Batch Normalization
Batch normalization(BN) [21] is first introduced for faster
convergence and training stability. In traditional deep net-
works, too-high learning rate may result in the gradients that
explode or vanish, as well as getting stuck in poor local min-
ima. BN helps address such issues. By normalizing activa-
tions throughout the network, it prevents small changes to
the parameters from amplifying into larger and suboptimal
changes in activations in gradients; for instance, it prevents the
training from getting stuck in the saturated regimes of nonlin-
earities. It leads to better generalization of network because it
has the implicit regularization effect and sometimes the neural
netowrk do not requires the explicit regularization techniques
such as dropout [48], mixout [32], weight decay [29], etc
Equation
(5)
is the batch normalized output(z) with
input(
x1··· xn
) is used to calculate the mean(
µ
) and
variance(
σ2
). Use of scale(
γ
) and bias(
β
) in Equation
(5)
give flexibility to work with normalized input(
ˆx
), if there is
a need, thus increasing the representation power.
µ=1
m
m
X
i=1
xiσ2=1
m
m
X
i=1
(xiµ)2(4)
z=γxµ
σ2++β(5)
At the inference stage, mini-batch estimations
µ
and
σ2
are not available, so BN tracks moving average of
the statistics during training(Equation
(6)
) where
α
is the
coefficient of moving average,
ˆµ
and
ˆσ
moving average
versions of
µ
and
σ
. These moving statistics
ˆµ
and
ˆσ
at
iteration t are used to normalize the feature map as given in
Equation (7) during inference.
ˆµ(t)=αµ(t)+(1α)ˆµ(t1) ˆ
σ2(t)=ασ2(t)+(1α)ˆ
σ2(t1)
(6)
ˆx=xˆµ
pˆ
σ2+
(7)
In the batch normalization setting the normalized feature
map(
ˆx
) follows the isotropic gaussian distribution [15] with
zero mean and unit variance i.e.
ˆxN(0,I)
. The reparam-
terization of the normalized feature map through rescaling
paramters i.e. mean and variance(Equation 11) allows the
output to represent the same family of functions of the input
as the old parametrization, but the new parametrization has
different learning dynamics as shown in Equation
(5)
. This
setting is not found useful for low resource setting and not
able to capture the expressiveness through normalization and
unable to perform well in out of domain generalization.
3.2. KL Regularized Batch Normalization
We denote normalized feature map,
zii= 1..., K
. To
make the data well behaved and capture more representative
power through normalization, we impose a prior,
ti
which
will follow gaussian distribution. Assume the Cross-Entropy
(CE) loss with respect to the neural network parameters
θ
is denoted by
L(θ)
. To impose prior on normalized feature
map, we pose the optimization as a constrained problem:
min
θL(θ)
s.t. E[dz(zi,ti)]i, i = 1,...,K
(8)
This is equivalent to
min
θL(θ)+λ
K
X
i=1
dz(xi,ti)(9)
We apply gradient-based optimization to the following loss
(
L(θ)
is the CE loss): In particular, we choose
E[dz(zi,ti)]
as KL loss between the normalized feature map and the
gaussian prior. This will bring the distribution of normalized
output closer to gaussian prior and add regularization effect
in the network. The
λ
is the hyperparameter which is tuned
according to the task.
In the proposed normalization setting, the affine param-
eters i.e.
γ
and
β
can be seen as the rescaling parameters
i.e. mean(
µv
) and standard deviation(
Σ
1
2
v
) of normalization
framework. The rescaling parameters (mean and variance)
can be modeled with deep neural network architecture. We
have used multi layer perceptron(MLP) to model mean and
variance. With this, the KL Regularized normalized output
ywill defined by the Equation 11.
µv=M LP (x) Σ
1
2
v=M LP (x)(10)
z=Σ
1
2
vbx+µv(11)
We have imposed the Gaussian prior on normalized
output though KL divergence loss function which is given
by Equation 12 where
µ0
and
µ1
are
K
dimensional mean
vectors, and Σ0and Σ1are diagonal co-variance matrices.
KL(N(µ0,Σ0)kN(µ1,Σ1)) = 1
2(tr1
1Σ0)+
(µ1µ0)TΣ1
1(µ1µ0)K+log( det(Σ1)
det(Σ0)))
(12)
KL loss has several properties which are useful for better
expressivity, generalization in low resource setting. We will
discuss each of these properties in the next subsections.
Regularization effect of Kullbach-Leibler Loss [1]
Thus KL loss acts as a regularizer term means ‘keep the
representations
z
sufficiently diverse’. If we don’t include
the regularizer, the model can learn to cheat and give each
datapoint a representation in a different region of Euclidean
space. With the Kl loss, a gaussian prior will be imposed
on normalization framework and which results in the data
points to follow the Gaussian distribution with similar
representation close together.
Ensemble effect
KL Regularized normalization incorpo-
rates uncertainty in two ways [31]. It is doubly stochastic in
the sense that both the underlying feature representation(
z
)
and the labels (
y
) are regarded as random variables. Con-
versely, Deep neural networks only regard the labels as being
random variables.KL Regularized normalization has the abil-
ity to model both mean and variance in the label predictions
by explicitly modeling the representation distribution
In most deep neural network, the output layer corresponds
to a distribution in which variance is a function of mean
(e.g., a binary classifier predicting
p
for a class occurrence
must also predict the variance
p(1p)
). The stochasticity in
the representation induces an effective ensemble of decoder
predictions [13, 40]. This ensemble effect of KL Regularized
normalozation helps the model to achieve higher accuracy
and reduces over-fitting. [31].
Out of domain generalization
The second source of
uncertainty is provided by the KL divergence between the
conditional distribution over latent space given the input and
the latent space defined by the learned marginal
pφ(z)
; i.e.
KL[q(z|x)||p(z)]
. Here, the marginal effectively learns a
density model for the data, albeit in the lowerF-dimensional,
lower-information latent-space rather than the original
input space. Density estimation, whether explicit [9] or im-
plicit [27], has been shown to be useful for out-of-distribution
detection.
Algorithm 1 KL Regularized Normalization
Input: mini-batch feature maps of each channel
x = {xi}m
i=1;
moving statistics ˆµ,ˆ
σ2;
moving statistics update rate α(0,1)
Output: z = KL-NORM(x)
1: Training:
µ1
mPm
i=1 xi
,
σ21
mPm
i=1(xiµ)2
//
mini batch mean and variance
ˆxxµ
σ2+
// normalize with mini-batch
statistics
µv
= MLP(
x
),
Σ
1
2
v
= MLP(
x
) //
rescaling paramters i.e. mean and variance.
z
=
Σ
1
2
vˆx
+
µv
// KL
Regularized normalized output feature map
ˆµαµ + (1 αµˆ
σ2ασ2+ (1 α)ˆ
σ2
// update estimations of moving statistics
2: Loss:
Loss = CE + βKL
3: Inference:
µv
= MLP(
x
),
Σ
1
2
v
= MLP(
x
). //
rescaling paramters i.e. mean and variance.
zΣ
1
2
vxˆµ
ˆ
σ2+
+
µv
// KL Regularized
normalization output feature map
Algorithm
Figure 1 shows the proposed KL-Norm
framework where the two multi-layer perceptron(MLP) layer
is used to compute the rescaling parameters(mean(
µv
) and
statistics(
Σ
1
2
v
)). The normalized feature map(
ˆy
) is calvulated
using mean batch statistics to do affine transformation with
rescaling parameters .
We have proposed the algorithm of KL Regularized
normalization(Algorithm 1). At the training stage , we
compute the mean batch statistics(
µ
and
σ2
) to get the
normalized feature map(
ˆx
). The rescaling paramters i.e.
mean(
µv
) and variance(
Σ
1
2
v
) is calculated which goes
into linear transformation to get the final output feature
map(
z
).The moving average statistics is calculated during
training to be used at inference .The cross entropy(CE) and
Kullbach-Leibler(KL) loss function is used while training
with
β
as a hyper-parameter. At the inference ,the moving
average statistics calculates the normalized feature map.
The final output is generated by linear transformation of
normalized feature map using rescaling paramters.
3.3. Model Architecture
Figure 1 shows the architecture having pre-trained Lan-
guage model, MLP module, KL Regularized normalization
and classifier.The pre-trained BERT
Base
(12 layers, 110M
parameters) and BERT
Large
(24 layers, 340M parameters)
uncased [8] implementation of [53] are used as our base
models which gives 768 dimensional embeddings for
sentence representation. The MLP module used to compute
the compressed sentence representations is a shallow
MLP with
768
,
2304+K
4
,
768+K
2
hidden units with a ReLU
non-linearity, where
K
is [
384,512
] gives K dimensional
embedding. The MLP module acts as a bottleneck module
for incorporating the relevant features in the KL Regularized
normalization through the rescaling parameters.The KL
Regularized normalization module include two linear layers
which are used to calculate the affine/rescaling parameters
to go into the normalization. A linear layer classifier is used
on top to classify the sentence.Similar to [5], we use a linear
annealing schedule for
β
and set it as
min(1,epoch×β0)
in
each epoch, where β0is the initial value.
For the speech related downstream tasks, we have used
encoder of Wav2Vec2.0 pre-trained model as base model
and added 1d convolution layer with 768 output channel. We
have added MLP module having linear layers to compute the
affine parameters of KL-Norm. A linear layer classifier is
used on top to classify the speech related classes.
3.4. Comparison methods
We have used the common normalization techniques used
in NLP tasks such as batch normalization(BN) [21], layer
normalization(LN) [2] and group normalization(GN) [54].
Such techniques do faster convergence and generalize well on
NLP datsets. Apart from that, other common regularization
techniques have also been used for the comparison namely
Dropout [48] and Weight [29] Decay. While experimenting,
we replace the KL Regularized normalization with the
comparison methods(Figure 1). We have also performed the
experiments with BERT
Base
model given in Figure 1 without
KL Regularized normalization.
3.5. Training Details
We have used the pre-trained BERT
Base
and BERT
Large
uncased models as the base models to see the effectiveness of
the proposed method on down-streamed NLP tasks . We use
the default hyper-parameters of BERT, i.e., we use a sequence
length of
128
, with batch size
8
. We use the stable variant
of the Adam optimizer [36, 57] with the default learning rate
of
2e5
through all experiments. We do not use warm-up
or weight decay.
We have used the Wav2Vec 2.0 [3] pre-trained on
LibriSpeech 960 hours as the. base model to see the.
performance of proposed method on downstream speech.
related tasks. We. have chosen Adam optimiser and set the
learning rate of the backbone to
1e5
for fine tuning and an
L2 penalty of 1e5was established.
3.6. Analysis of Increased Expressive Power
Glue Benchmarks
We have used the three low resource
datasets namely MRPC and RTE for the evaluation of
the proposed method. Table 1 shows the comparison of
the proposed model with the compared models having
BERT-base uncased as the base model. It shows that the
proposed normalization model outperforms the other method
in various evaluation metrics. Our KL-Norm method
substantially improves the results and surpasses the prior
work in all settings for both BERT
Base
and BERT
Large
models.
Due to the computational overhead of BERT
Large
, for the rest
of this work, we stick to BERTBase.
Low Resource varying datasets
We have used the four
large NLP datasets such as SNLI, MNLI, QNLI, and YELP
and subsample the dataset using the random seeds. We then
evaluated the performance of NLI datasets under varying
sizes of training data (200, 400, 600 800 and 1000 samples).
We reported the average and standard deviation across 5
different seeds. Table 2 shows that KL-Norm consistently
outperforms all the baselines on low-resource scenarios.
We have also performed the experiments on three large
speech related datasets such as Google Speech Command,
Crema-D and ESD. We have evaluated the performance under
varying size using random seeds. Table 3 have shown the that.
KL-Norm performs better as compared to other methods.
3.7. Analysis of Out of domain Generalization
We have used various NLP datasets to see the effectiveness
of KL Regularized normalization in the out of domain
generalization .We have used datasets referred in [34],
including SICK [35], ADD1 [42], JOCI [56], MPE [30],
MNLI, SNLI, SciTail [24], and three datasets from [52]
namely DPR [45], FN+ [43], SPR [46], and Quora Question
Pairs (QQP) interpreted as an NLI task as by [14]. We use the
same split used in [51] for the experiment. We have trained
the model on 6000 samples of SNLI and MNLI datasets and
used these dataset as the test set to see the out of domain
generalization of the proposed model.The SNLI and MNLI
datasets contain three labels of contradiction, neutral, and
entailment. However, some of the considered target datasets
Figure 1. Architecture of proposed KL Regularized normalization framework
Table 1. Average results and standard deviation in parentheses over 5 runs on low-resource data in GLUE.
shows the absolute difference
between the results of the KL-Norm model with BERT-base.
MRPC STS-B RTE
Model Accuracy F1 Pearson Spearman Accuracy
BERTBase 84.31 (0.2) 87.01 (0.2) 84.43 (0.2) 83.28 (0.1) 64.23 (1.8)
+BN [21] 86.76 (0.5) 90.49 (0.5) 82.28 (0.5) 81.57(0.7) 66.42 (1.4)
+LN [2] 85.23 (0.3) 88.12 (0.7) 84.33 (0.9) 82.98(1.0) 65.17 (0.9)
+GN [54] 85.01 (0.2) 87.98 (0.5) 82.76 (0.8) 81.91(1.1) 65.55 (0.2)
+Dropout [48] 85.55 (0.6) 88.47 (0.2) 84.11 (0.8) 82.65 (0.7) 65.12 (0.9)
+WD [32] 85.01(0.2) 86.91(0.2) 84.02(0.8) 82.29(0.5) 65.02(0.8)
KL-NormBase 87.25 (0.1) 91.03 (0.6) 82.49 (0.9) 81.63 (0.8) 70.42 (0.8)
+2.94 +4.02 -1.94 -1.65 +6.19
BERTLarge 86.76 (0.7) 90.20 (1.3) 86.27 (0.3) 85.21 (0.1) 67.11 (0.8)
+BN [21] 85.53 (0.5) 85.44 (0.5) 84.47 (0.3) 82.73(0.7) 66.7 7(1.4)
+LN [2] 86.27 (0.5) 86.41 (0.5) 86.52 (0.2) 85.73(0.3) 67.23 (1.4)
+GN [54] 85.72 (0.3) 85.87 (0.2) 85.10 (0.3) 85.21(0.7) 66.2 (0.9)
+Dropout [48] 86.51 (0.4) 89.98 (0.2) 86.13 (0.8) 85.02 (0.5) 67.01 (0.6)
+WD [32] 86.11(0.5) 89.03(0.4) 86.18(0.2) 85.10(0.1) 68.07(0.9)
KL-NormLarge 87.99 (0.4) 91.38 (0.6) 86.01 (0.7) 84.68 (0.9) 71.01 (0.8)
+1.23 +1.18 -0.26 -0.53 +3.9
have only two labels, such as DPR or SciTail. When the
target dataset has two labels of entailed and not-entailed, as
in DPR, we consider the predicted contradiction and neutral
labels as the not-entailed label. In the case the target dataset
has two labels of entailment and neutral, as in SciTail, we
consider the predicted contradiction label as neutral.
Table 4 shows that KL-Norm gives an average improve-
ment of 2.62% and 4.5% over accuracy when trained
with SNLI and MNLI respectively from BERT
Base
model.
It has shown substantial improvement against all other
baseline models. These results support our claim that
KL-Norm motivates learning more general features, rather
than redundant superficial features, leading to an improved
generalization to datasets without these superficial biases.
3.8. Impact of KL-Norm on Overfitting
Loglikelihood and KL-divergence are typically balanced
by a suitable
β
-parameter, since they have somewhat
Table 2. Test accuracies in the low-resource setting on text classification and NLI datasets under varying sizes of training data (200, 400,
600, 800 and 1000 samples). shows the absolute difference between the results of the KL-Norm model with BERT-base.
Data Model 200 400 600 800 1000
SNLI
BERTBase 60.07 (0.6) 66.87 (0.4) 68.69 (0.9) 72.47 (0.8) 72.98 (0.4)
+BN 58.43 (0.4) 66.29 (0.5) 68.76 (0.4) 71.96 (0.2) 73.43 (0.5)
+LN 59.95 (0.3) 65.26 (0.4) 68.68 (0.6) 72.52 (0.1) 73.01 (0.2)
+GN 58.60 (0.2) 66.11 (0.6) 67.57 (0.5) 72.01 (0.3) 73.10 (0.2)
+Dropout 58.44 (0.4) 66.76 (0.9) 68.74 (0.7) 72.12 (0.5) 72.58 (0.7)
+WD 59.23 (0.6) 66.11 (0.9) 68.41(0.8) 72.48 (0.6) 72.52 (0.3)
+KL-Norm 62.55 (1.3) 69.02 (0.6) 70.47 (0.3) 73.92 (0.2) 74.05 (0.7)
+2.48 +2.15 +1.78 +1.45 +1.07
MNLI
BERTBase 45.53 (1.1) 51.12 (0.9) 57.74 (0.8) 58.83 (0.3) 60.19 (0.7)
+BN 46.34 (0.4) 52.14 (1.3) 55.7 (0.8) 56.5 (0.5) 59.5 (0.7)
+LN 45.25 (1.3) 51.42 (1.3) 57.79 (0.4) 59.03 (0.6) 60.32 (0.7)
+GN 44.17 (1.8) 50.53 (1.1) 57.34 (0.6) 58.60 (0.3) 59.17 (0.5)
+Dropout 45.44 (0.8) 51.32 (1.1) 57.65 (1.1) 59.43 (0.4) 60.08 (0.9)
+WD 45.72 (0.7) 51.42 (0.4) 57.34 (0.7) 58.88 (0.6) 60.24 (0.4)
+KL-Norm 47.65 (0.7) 53.20 (1.4) 59.29 (1.2) 60.17 (0.9) 61.24 (0.8)
+2.12 +2.08 +1.55 +1.34 +1.05
QNLI
BERTBase 71.12 (0.6) 75.30 (0.4) 75.9 (0.8) 78.18 (0.2) 79.51 (0.4)
+BN 71.70 (0.2) 74.13 (0.5) 75.7 (0.3) 77.12 (0.4) 77.32 (0.4)
+LN 71.95 (0.4) 74.97 (0.8) 75.8 (0.3) 78.38 (0.2) 79.67 (0.4)
+GN 71.24 (0.4) 74.16 (0.6) 75.7 (0.2) 77.55 (0.1) 77.45 (0.3)
+Dropout 71.43 (1.2) 74.77 (0.4) 75.23 (0.7) 78.67 (0.5) 79.23 (0.3)
+WD 71.43 (0.3) 74.61 (0.6) 75.51 (0.2) 78.47 (0.9) 79.33 (0.4)
KL-Norm 73.20 (0.5) 76.82 (0.8) 76.97 (0.3) 79.12 (0.8) 80.34 (0.7)
+2.08 +1.52 +1.07 +0.94 +0.83
YELP
BERTBase 41.58 (0.3) 44.02 (0.5) 45.54 (0.5) 47.62 (0.9) 47.92 (0.7)
+BN 38.08 (0.3) 43.18 (0.7) 44.56 (0.3) 45.58 (0.3) 47.04 (0.8)
+LN 41.13 (0.4) 43.37 (0.4) 46.01 (0.6) 46.35 (0.5) 47.94 (0.8)
+GN 40.34 (0.7) 43.11 (0.6) 45.12 (0.2) 45.79 (0.4) 47.17 (0.3)
+Dropout 40.86 (0.3) 43.37 (0.4) 45.02 (0.9) 46.77 (0.7) 48.02 (0.2)
+WD 40.77 (0.9) 43.60 (1.1) 45.07 (0.6) 47.12 (1.1) 47.83 (0.9)
KL-Norm 41.48 (0.4) 43.86 (1.1) 46.10 (0.7) 48.38 (0.4) 48.58 (0.4)
-0.1 -0.16 +0.56 +0.76 +0.66
contrasting effects: the former will try to improve the quality
of the reconstruction, neglecting the shape of the latent
space; on the other side, KL-divergence is normalizing and
smoothing the latent space. Tuning the
β
parameter is crucial
to reduce the over-fitting. We analyze the effect of KL-Norm
on the generalization of model and in reducing overfitting.
We analyze the effect of the
β
parameter on training and
validation error. We fix the bottleneck size (
K
) based on the
models selected in Section 3.6, and we train KL-Norm model
on the GLUE benchmark for varying values of
β
and plot the
validation and training loss in Figure 2 and 3.
KL-Norm has little effect for the small value of
β
as
the validation loss is substantially higher than training loss
showing the case of over-fitting. This is because network
become too deterministic (
Σ0
) and learns irrelevant
features not needed to predict the labels.As we increase the
β
, we see the better generalization of the network. As the
β
become too large, again the validation loss increases as it
starts blocks relevant features needed to predict the labels.
3.9. Analysis of Removal of irrelevant features
We have used [4, 10] framework to evaluate whether the
debiasing methods are succesful in removing the biases from
sentence or not. After debiasing, the trained encoder is frozen
and the classifier is retrained to try to extract the biases. If
the classifier reaches high accuracy given only bias features,
then the encoder’s representation has not been successfully
debiased. We train a classifier which only sees the represen-
Table 3. Test accuracies in the low-resource setting on speech classification under varying sizes of training data (300, 600, 900, 1200 and
1500 samples). shows the absolute difference between the results of the KL-Norm model with Wav2Vec-base.
Data Model 300 600 900 1200 1500
GoogleC
Wav2vec 63.21 (0.9) 83.15 (0.5) 84.32 (0.7) 85.58 (0.5) 86.92 (0.4)
+BN 64.70 (0.6) 84.99 (0.7) 86.81 (0.6) 87.33 (0.7) 88.17 (0.9)
+LN 63.86 (0.4) 84.11 (0.5) 85.68 (0.6) 86.99 (0.2) 87.73 (0.6)
+GN 63.14 (0.2) 84.03 (0.2) 85.16 (0.4) 86.45 (0.6) 87.01(0.1)
+Dropout 63.11 (0.3) 83.45 (0.7) 84.87 (0.4) 85.63 (0.7) 86.61 (0.4)
+WD 63.21 (0.8) 83.56(0.6) 84.93(0.5) 85.77 (0.8) 86.79 (0.4)
+KL-Norm 70.16 (0.9) 87.17 (0.7) 87.93 (0.8) 88.44 (0.4) 89.08 (0.3)
+6.95 +4.02 +3.61 +2.86 +2.16
ESD
Wav2Vec 55.23 (0.7) 62.02 (0.8) 72.56 (0.3) 75.01 (0.4) 77.12 (0.5)
+BN 56.82 (0.6) 62.93 (0.6) 73.67 (0.4) 75.43 (0.8) 77.88 (0.7)
+LN 56.36 (0.3) 62.58 (0.7) 73.11 (0.8) 75.19 (0.6) 77.16 (0.4)
+GN 55.08 (0.3) 62.11 (0.2) 72.36 (0.8) 74.79 (0.6) 76.52 (0.4)
+Dropout 54.15 (0.6) 61.66 (0.8) 71.56(0.7) 74.33 (0.2) 76.21 (0.4)
+WD 54.34 (0.7) 61.79 (0.6) 71.62 (0.4) 74.72 (0.9) 76.19 (0.8)
KL-Norm 59.62 (0.7) 65.86 (0.8) 75.67 (0.6) 77.13 (0.2) 78.53 (0.7)
+4.39 +3.84 +3.11 +2.12 +1.41
Crema
Wav2vec 52.12 (0.2) 57.13 (1.2) 59.74 (0.5) 63.76 (0.5) 65.67 (0.6)
+BN 52.07 (0.2) 58.03 (0.9) 59.76 (0.7) 64.62 (0.8) 65.26 (0.2)
+LN 51.38 (0.9) 57.29 (0.8) 59.16 (0.7) 64.01 (0.6) 64.94 (0.7)
+GN 51.05 (0.9) 57.12 (0.4) 59.21 (0.2) 63.42 (0.8) 63.79 (0.6)
+Dropout 50.61 (0.7) 56.88 (0.8) 57.03 (0.8) 62.67 (0.6) 63.08 (0.7)
+WD 50.88 (0.7) 56.91 (0.7) 57.28 (0.6) 62.84 (0.2) 63.19 (0.4)
+KL-Norm 55.45 (0.8) 58.95 (1.3) 61.21 (0.9) 64.89 (0.8) 66.57 (0.6)
+3.33 +1.82 +1.47 +1.13 +0.90
Table 4. Test accuracy of models transferring to new target datasets. All models are trained on SNLI or MNLI and tested on the target datasets.
are absolute differences with BERTBase.
SNLI MNLI
Data BERTBase +BN +LN +GN +KL-Norm BERTBase +BN +LN +GN +KL-Norm
JOCI 46.03 47.13 44.67 45.55 51.82 +5.79 46.41 48.12 45.21 44.31 54.11 +7.7
ADD1 45.61 38.75 44.7 39.81 44.14 -1.47 51.22 35.65 46.42 47.21 56.67 +5.45
DPR 49.22 49.12 49.18 49.11 50.31 +1.09 49.95 49.31 49.4 49.3 50.11 +0.16
SPR 37.07 35.43 37.48 35.55 36.12 +1.7 42.37 40.45 45.12 42.11 43.17 +0.80
FN+ 45.31 50.61 44.31 47.71 47.35 -0.95 43.5 43.2 43.53 43.3 44.88 +1.38
SICK 53.06 46.78 53.98 45.11 55.61 +2.55 65.07 62.43 66.60 63.33 71.64 +6.53
MPE 58.12 57.21 57.44 54.33 63.61 +5.48 55.10 54.43 57.5 56.48 58.31 +3.21
SCITAIL 64.81 58.13 65.23 58.43 70.88 +6.07 64.67 67.49 66.61 64.31 75.43 +10.76
QQP 62.91 60.93 59.88 58.07 65.87 +2.96 63.38 60.16 62.89 60.11 69.67 +6.29
SNLI Hard 64.39 63.12 65.37 63.31 67.45 +3.06 54.11 53.78 54.38 53.55 56.91 +2.8
Average +2.62 +4.50
tation of the hypothesis sentence and see if it can predict the
class of the sentence pair, which is an established criterion to
measure known biases in NLI datasets [17]. Thus, we freeze
the trained encoders from our model and the BERT baseline
and retrain a hypothesis-only classifier on hypotheses from
the SNLI and MNLI datasets.For reference, we compare to
a hypothesis-only model with a BERT encoder trained end-
to-end. Table 5 shows the results which shows that KL-Norm
is able to achieve lower accuracy against all baseline models.
Table 5. Hypothesis-only accuracy when freezing the encoder
from models trained on SNLI/MNLI in Table 2 and retraining a
hypothesis-only classifier, and baseline results when the encoder is
not frozen (H-only). Lower results show more successful debiasing.
Model SNLI MNLI
Train Dev Test Train Dev Test
H-only 95.2 56.62 55.78 83.12 51.34 51.12
BERTBase 73.2 51.87 51.17 52.7 42.68 43.55
+BN 71.1 51.91 51.11 58.5 44.68 44.03
+LN 70.9 51.98 52.26 58.5 44.68 44.03
+GN 71.9 52.75 53.08 58.7 44.91 44.72
+KL-Norm 49.5 41.41 40.12 37.4 35.28 35.99
3.10. Analysis of model parameters
Table 6 shows the efficiency evaluation of the proposed
model in terms of number of model parameters and memory
overheads with K =
512
. The peak memory overheads in-
creases by 1.92% against all other baseline model. KL-norm
has substantially lower memory overheads as compared to
weight decay. VI-Norm is useful while dealing with large
scale transformer model such BERT, ROBERTA, etc. There
is a negligible increment(1.68%) of the model parameters due
to addition of MLP layers to calculate the affine parameters
i.e. shift and scale.
10 910 710 510 310 1101103
0.0
0.5
1.0
1.5
CE Loss
MRPC
validation
training
Figure 2. Validation and training losses of KL-Norm for varying
βand a fixed bottleneck size on GLUE.
3.11. Ablation Study
Analysis of model without KL loss
Table 7 shows the eval-
uation on three GLUE datasets without regularization loss(
β
=
0
). The architecture just reduces to deterministic dimen-
sionality reduction with an MLP. This evaluation shows the
performance increment while adding the regularization loss.
Analysis of model in high resource setting
We have done
the experiments win high resource setting with two NLP
10 910 710 510 310 1101103
0
1
2
CE Loss
RTE
validation
training
Figure 3. Validation and training losses of KL-Norm for varying
βand a fixed bottleneck size on GLUE.
Table 6. Performance evaluation for all methods.
%
are relative
differences with BERTBase.
Model Memory % #Parameters %
BERTBase 418.74 GB 109.48 M
+BN 418.74 GB 0% 109.48 M 0%
+LN 418.74 GB 0% 109.48 M 0%
+GN 418.74 GB 0% 109.48 M 0%
+WD 506.19 GB 20.88% 109.48 M 0%
+Dropout 418.74 GB 0% 109.48 M 0%
+KL-Norm 426.80 GB 1.92 % 111.33 M 1.68%
Table 7. Average ablation results over 5 runs with std in parentheses
on GLUE.
MRPC RTE
Model Accuracy F1 Accuracy
BERTBase 84.31 (0.2) 89.01 (0.2) 64.23 (1.8)
+BN 86.76 (0.5) 90.49 (0.5) 66.42 (1.4)
+LN 85.23 (0.3) 88.12 (0.7) 65.17 (0.9)
+GN 85.01 (0.2) 87.98 (0.5) 65.55 (0.2)
+KL-Norm (β=0) 86.27 (0.4) 90.03 (0.3) 67.17 (0.8)
+KL-Norm 87.25 (0.1) 91.03 (0.6) 70.42 (0.8)
datasets (MNLI and QNLI)and one speech datasets(Google
speech Command) to see the behaviour of the proposed
model . We found the results are comparable with other nor-
malization techniques as shown in table 8. The reason is the
learnable parameters of traditional normalization techniques
is able to make the data well behaved with the proposed KL
regularized normalization in higher resource settings.
4. Conclusion
In this paper, we have proposed a novel KL Regular-
ized normalization framework, KL-Norm, that calculates the
rescaling parameters of normalization along with imposing
Table 8. Average accuracy over 5 runs with std in parentheses in
High resource setting
Model MNLI QNLI GoogleC
BERTBase 82.92(0.24) 90.12(0.12) -
+BN 82.23 (0.21) 89.14(0.23) -
+LN 82.98 (0.18) 90.34 (0.14) -
+GN 82.01 (0.13) 90.01 (0.15) -
+KL-Norm 82.27 (0.28) 90.03 (0.09) -
Wav2Vec - - 94.78(0.12)
+BN - - 93.12(0.13)
+LN - - 94.84(0.17)
+GN - - 94.52(0.31)
+KL-Norm - - 94.29(0.21)
the gaussian prior through KL loss. It incorporates stochastic-
ity which gives ensemble effect that helps the model to give
higher accuracy. Addition of KL loss acts as a regularizer that
reduces overfitting and better generalization. It removes the
irrelevant features of data. This approach is based on density
estimation which performs better on out of domain generaliza-
tion. Experimental evaluations on low resource downstream
NLP tasks using pre-trained BERT and speech task using
pre-trained Wav2Vec2.0 model demonstrate superior perfor-
mance of the proposed framework against baseline models.
References
[1]
Andrea Asperti and Matteo Trentin. Balancing
reconstruction error and kullback-leibler divergence
in variational autoencoders, 2020.
[2]
Jimmy Ba, J. Kiros, and Geoffrey E. Hinton. Layer
normalization. ArXiv, abs/1607.06450, 2016.
[3]
Alexei Baevski, Henry Zhou, Abdel rahman Mohamed,
and Michael Auli. wav2vec 2.0: A framework for
self-supervised learning of speech representations.
ArXiv, abs/2006.11477, 2020.
[4]
Yonatan Belinkov, Adam Poliak, S. Shieber, Ben-
jamin Van Durme, and Alexander M. Rush. On
adversarial removal of hypothesis-only bias in natural
language inference. In *SEMEVAL, 2019.
[5]
Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew
Dai, Rafal Jozefowicz, and Samy Bengio. Generating
sentences from a continuous space. In CoNLL, 2016.
[6]
Pratik Chaudhari, Anna Choromanska, Stefano Soatto,
Yann LeCun, Carlo Baldassi, Christian Borgs, Jen-
nifer T. Chayes, Levent Sagun, and Riccardo Zecchina.
Entropy-SGD: Biasing Gradient Descent Into Wide
Valleys. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings, 2017.
[7]
Tim Cooijmans, Nicolas Ballas, C
´
esar Laurent, and
Aaron C. Courville. Recurrent batch normalization.
ArXiv, abs/1603.09025, 2017.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding.
In NAACL, 2019.
[9]
Terrance Devries and Graham W. Taylor. Learning
confidence for out-of-distribution detection in neural
networks. ArXiv, abs/1802.04865, 2018.
[10]
Yanai Elazar and Yoav Goldberg. Adversarial removal
of demographic attributes from text data. In EMNLP,
2018.
[11]
Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu.
Exploring wav2vec 2.0 on speaker verification and
language identification. ArXiv, abs/2012.06185, 2021.
[12]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan,
Pascal Germain, Hugo Larochelle, Fran
c¸
ois Laviolette,
Mario Marchand, and Victor Lempitsky. Domain-
adversarial Training of Neural Networks. Journal of Ma-
chine Learning Research, 17(1):2096–2030, Jan. 2016.
[13]
Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi,
Mohsin Ali, Jongseo Lee, Matthias Humt, Jianxiang
Feng, Anna Kruspe, R. Triebel, P. Jung, R. Roscher, M.
Shahzad, Wen Yang, R. Bamler, and Xiaoxiang Zhu.
A survey of uncertainty in deep neural networks. ArXiv,
abs/2107.03342, 2021.
[14]
Yichen Gong, Heng Luo, and Jian Zhang. Natural lan-
guage inference over interaction space. In ICLR, 2017.
[15]
Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.
[16]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative Adversarial
Nets. In Advances in Neural Information Processing
Systems 27, pages 2672–2680. 2014.
[17]
Suchin Gururangan, Swabha Swayamdipta, Omer Levy,
Roy Schwartz, Samuel Bowman, and Noah A. Smith.
Annotation artifacts in natural language inference data.
In NAACL, 2018.
[18]
Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catan-
zaro, Gregory Frederick Diamos, Erich Elsen, Ryan J.
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam
Coates, and A. Ng. Deep speech: Scaling up end-to-end
speech recognition. ArXiv, abs/1412.5567, 2014.
[19]
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Visualizing
and Understanding the Effectiveness of BERT. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 4141–4150,
Hong Kong, China, Nov. 2019. Association for
Computational Linguistics.
[20]
X. Huang and Serge J. Belongie. Arbitrary style transfer
in real-time with adaptive instance normalization. 2017
IEEE International Conference on Computer Vision
(ICCV), pages 1510–1519, 2017.
[21]
S. Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal
covariate shift. ArXiv, abs/1502.03167, 2015.
[22]
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov,
Dmitry Vetrov, and Andrew Gordon Wilson. Averaging
weights leads to wider optima and better generalization.
In Ricardo Silva, Amir Globerson, and Amir Globerson,
editors, 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, 34th Conference on
Uncertainty in Artificial Intelligence 2018, UAI
2018, pages 876–885. Association For Uncertainty in
Artificial Intelligence (AUAI), 2018.
[23]
Songhao Jia, Ding-Jie Chen, and Hwann-Tzong
Chen. Instance-level meta normalization. 2019
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 4860–4868, 2019.
[24]
Tushar Khot, Ashish Sabharwal, and Peter Clark.
Scitail: A textual entailment dataset from science
question answering. In AAAI, 2018.
[25]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang.
Bilinear attention networks. In NeurIPS, 2018.
[26]
Taesup Kim, Inchul Song, and Yoshua Bengio. Dynamic
layer normalization for adaptive neural acoustic model-
ing in speech recognition. ArXiv, abs/1707.06065, 2017.
[27]
M. Kliger and S. Fleishman. Novelty detection with
gan. ArXiv, abs/1802.10560, 2018.
[28]
Zhi Kou, Kaichao You, Mingsheng Long, and Jianmin
Wang. Stochastic normalization. In NeurIPS, 2020.
[29]
Anders Krogh and John A Hertz. A simple weight
decay can improve generalization. In NeurIPS, 1992.
[30]
Alice Lai, Yonatan Bisk, and Julia Hockenmaier.
Natural language inference from multiple premises. In
IJCNLP, 2017.
[31]
Balaji Lakshminarayanan, A. Pritzel, and C. Blundell.
Simple and scalable predictive uncertainty estimation
using deep ensembles. In NIPS, 2017.
[32]
Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang.
Mixout: Effective regularization to finetune large-scale
pretrained language models. In ICLR, 2019.
[33]
Ping Luo, Ruimao Zhang, Jiamin Ren, Z. Peng, and J.
Li. Switchable normalization for learning-to-normalize
deep representation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 43:712–728, 2021.
[34]
Karimi Rabeeh Mahabadi, Yonatan Belinkov, and
James Henderson. End-to-end bias mitigation by
modelling biases in corpora. In ACL, 2020.
[35]
Marco Marelli, Stefano Menini, Marco Baroni, Luisa
Bentivogli, Raffaella Bernardi, and Roberto Zamparelli.
A sick cure for the evaluation of compositional
distributional semantic models. In LREC, 2014.
[36]
Marius Mosbach, Maksym Andriushchenko, and
Dietrich Klakow. On the stability of fine-tuning bert:
Misconceptions, explanations, and strong baselines.
ICLR, 2021.
[37]
Hyeonseob Nam and Hyo-Eun Kim. Batch-instance
normalization for adaptively style-invariant neural
networks. In NeurIPS, 2018.
[38]
Y. Nesterov. Introductory lectures on convex optimiza-
tion - a basic course. In Applied Optimization, 2004.
[39]
Abhishek Niranjan, Mukesh C. Sharma, Sai
Bharath Chandra Gutha, and M. Ali Basha Shaik.
End-to-end whisper to natural speech conversion using
modified transformer network. 2021.
[40]
Yaniv Ovadia, E. Fertig, J. Ren, Zachary Nado, D.
Sculley, S. Nowozin, Joshua V. Dillon, Balaji Laksh-
minarayanan, and Jasper Snoek. Can you trust your
model’s uncertainty? evaluating predictive uncertainty
under dataset shift. In NeurIPS, 2019.
[41]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang,
and Jun-Yan Zhu. Semantic image synthesis with
spatially-adaptive normalization. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, 2019.
[42]
Ellie Pavlick and Chris Callison-Burch. Most “babies”
are “little” and most “problems” are “huge”: Compo-
sitional entailment in adjective-nouns. In ACL, 2016.
[43]
Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi,
Chris Callison-Burch, Mark Dredze, and Benjamin
Van Durme. Framenet+: Fast paraphrastic tripling of
framenet. In ACL, 2015.
[44]
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. Language models
are unsupervised multitask learners. 2019.
[45]
Altaf Rahman and Vincent Ng. Resolving complex
cases of definite pronouns: the winograd schema
challenge. In EMNPL, 2012.
[46]
Drew Reisinger, Rachel Rudinger, Francis Ferraro,
Craig Harman, Kyle Rawlins, and Benjamin Van Durme.
Semantic proto-roles. In TACL, 2015.
[47]
Tim Salimans and Diederik P. Kingma. Weight nor-
malization: A simple reparameterization to accelerate
training of deep neural networks. In NIPS, 2016.
[48]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a
simple way to prevent neural networks from overfitting.
In JMLR, 2014.
[49]
Songbo Tan and Jin Zhang. An Empirical Study of Senti-
ment Analysis for Chinese Documents. Expert Systems
with Applications, 34(4):2622–2629, May 2008.
[50]
D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance nor-
malization: The missing ingredient for fast stylization.
ArXiv, abs/1607.08022, 2016.
[51]
Zhiguo Wang, Wael Hamza, and Radu Florian. Bilat-
eral multi-perspective matching for natural language
sentences. In IJCAI, 2017.
[52]
Aaron Steven White, Pushpendre Rastogi, Kevin Duh,
and Benjamin Van Durme. Inference is everything:
Recasting semantic resources into a unified evaluation
framework. In IJCNLP, 2017.
[53]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pierric
Cistac, Tim Rault, R
´
emi Louf, Morgan Funtowicz, and
Jamie Brew. Huggingface’s transformers: State-of-the-
art natural language processing. arXiv:1910.03771,
2019.
[54]
Yuxin Wu and Kaiming He. Group normalization. In
ECCV, 2018.
[55]
Hemant Yadav, Akshat Gupta, Sai Krishna Rallabandi,
Alan W Black, and Rajiv Ratn Shah. Intent classifi-
cation using pre-trained embeddings for low resource
languages, 2021.
[56]
Sheng Zhang, Rachel Rudinger, Kevin Duh, and
Benjamin Van Durme. Ordinal common-sense
inference. In TACL, 2017.
[57]
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q
Weinberger, and Yoav Artzi. Revisiting few-sample
bert fine-tuning. ICLR, 2021.
[58]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and
Ziwei Liu. Learning to prompt for vision-language
models. Int. J. Comput. Vis., 130:2337–2348, 2022.
[59]
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin
Knight. Transfer Learning for Low-Resource Neural
Machine Translation. In Proceedings of the 2016
Conference on Empirical Methods in Natural Language
Processing, pages 1568–1575, Austin, Texas, Nov.
2016. Association for Computational Linguistics.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
Article
Full-text available
Likelihood-based generative frameworks are receiving increasing attention in the deep learning community, mostly on account of their strong probabilistic foundation. Among them, Variational Autoencoders (VAEs) are reputed for their fast and tractable sampling and relatively stable training, but if not properly tuned they may easily produce poor generative performances. The loss function of Variational Autoencoders is the sum of two components, with somehow contrasting effects: the reconstruction loss , improving the quality of the resulting images, and the Kullback-Leibler divergence , acting as a regularizer of the latent space. Correctly balancing these two components is a delicate issue, and one of the major problems of VAEs. Recent techniques address the problem by allowing the network to learn the balancing factor during training, according to a suitable loss function. In this article, we show that learning can be replaced by a simple deterministic computation, expressing the balancing factor in terms of a running average of the reconstruction error over the last minibatches. As a result, we keep a constant balance between the two components along training: as reconstruction improves, we proportionally decrease KL-divergence in order to prevent its prevalence, that would forbid further improvements of the quality of reconstructions. Our technique is simple and effective: it clarifies the learning objective for the balancing factor, and it produces faster and more accurate behaviours. On typical datasets such as Cifar10 and CelebA, our technique sensibly outperforms all previous VAE architectures with comparable parameter capacity.
Article
We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SciTail is the first entailment set that is created solely from natural sentences that already exist independently ``in the wild'' rather than sentences authored specifically for the entailment task. Different from existing entailment datasets, we create hypotheses from science questions and the corresponding answer candidates, and premises from relevant web sentences retrieved from a large corpus. These sentences are often linguistically challenging. This, combined with the high lexical similarity of premise and hypothesis for both entailed and non-entailed pairs, makes this new entailment task particularly difficult. The resulting challenge is evidenced by state-of-the-art textual entailment systems achieving mediocre performance on SciTail, especially in comparison to a simple majority class baseline. As a step forward, we demonstrate that one can improve accuracy on SciTail by 5% using a new neural model that exploits linguistic structure.
Article
We address a learning-to-normalize problem by proposing Switchable Normalization (SN), which learns to select different normalizers for different normalization layers of a deep neural network. SN employs three distinct scopes to compute statistics (means and variances) including a channel, a layer, and a minibatch. SN switches between them by learning their importance weights in an end-to-end manner. It has several good properties. First, it adapts to various network architectures and tasks (see Fig. 1 ). Second, it is robust to a wide range of batch sizes, maintaining high performance even when small minibatch is presented (e.g., 2 images/GPU). Third, SN does not have sensitive hyper-parameter, unlike group normalization that searches the number of groups as a hyper-parameter. Without bells and whistles, SN outperforms its counterparts on various challenging benchmarks, such as ImageNet, COCO, CityScapes, ADE20K, MegaFace and Kinetics. Analyses of SN are also presented to answer the following three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? We hope SN will help ease the usage and understand the normalization techniques in deep learning. The code of SN has been released at https://github.com/switchablenorms .