PreprintPDF Available

Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Transformers are state-of-the-art in a wide range of NLP tasks and have also been applied to many real-world products. Understanding the reliability and certainty of transformer model predictions is crucial for building trustable machine learning applications, e.g., medical diagnosis. Although many recent transformer extensions have been proposed, the study of the uncertainty estimation of transformer models is under-explored. In this work, we propose a novel way to enable transformers to have the capability of uncertainty estimation and, meanwhile, retain the original predictive performance. This is achieved by learning a hierarchical stochastic self-attention that attends to values and a set of learnable centroids, respectively. Then new attention heads are formed with a mixture of sampled centroids using the Gumbel-Softmax trick. We theoretically show that the self-attention approximation by sampling from a Gumbel distribution is upper bounded. We empirically evaluate our model on two text classification tasks with both in-domain (ID) and out-of-domain (OOD) datasets. The experimental results demonstrate that our approach: (1) achieves the best predictive performance and uncertainty trade-off among compared methods; (2) exhibits very competitive (in most cases, improved) predictive performance on ID datasets; (3) is on par with Monte Carlo dropout and ensemble methods in uncertainty estimation on OOD datasets.
Transformer Uncertainty Estimation with Hierarchical Stochastic Attention
Jiahuan Pei1,2*, Cheng Wang2†, Gy ¨
orgy Szarvas2
1University of Amsterdam
2Amazon Development Center Germany GmbH, Berlin, Germany
{jpei, cwngam, szarvasg}@amazon.de
Abstract
Transformers are state-of-the-art in a wide range of NLP
tasks and have also been applied to many real-world prod-
ucts. Understanding the reliability and certainty of trans-
former model predictions is crucial for building trustable
machine learning applications, e.g., medical diagnosis. Al-
though many recent transformer extensions have been pro-
posed, the study of the uncertainty estimation of transformer
models is under-explored. In this work, we propose a novel
way to enable transformers to have the capability of uncer-
tainty estimation and, meanwhile, retain the original predic-
tive performance. This is achieved by learning a hierarchi-
cal stochastic self-attention that attends to values and a set of
learnable centroids, respectively. Then new attention heads
are formed with a mixture of sampled centroids using the
Gumbel-Softmax trick. We theoretically show that the self-
attention approximation by sampling from a Gumbel distribu-
tion is upper bounded. We empirically evaluate our model on
two text classification tasks with both in-domain (ID) and out-
of-domain (OOD) datasets. The experimental results demon-
strate that our approach: (1) achieves the best predictive per-
formance and uncertainty trade-off among compared meth-
ods; (2) exhibits very competitive (in most cases, improved)
predictive performance on ID datasets; (3) is on par with
Monte Carlo dropout and ensemble methods in uncertainty
estimation on OOD datasets.
1 INTRODUCTION
Uncertainty estimation and quantification are important
tools for building trustworthy and reliable machine learning
systems (Lin, Engel, and Eslinger 2012; Kabir et al. 2018;
Riedmaier et al. 2021). Particularly, when such machine-
learned systems are applied to make predictions that in-
volve important decisions, e.g., medical diagnosis (Ghoshal
and Tucker 2020), financial planning and decision-making
(Baker et al. 2020; Oh and Hong 2021), and autonomous
driving (Hoel, Wolff, and Laine 2020). The recent devel-
opment of neural networks has shown excellent predictive
performance in many domains. Among those, transformers,
including the vanilla transformer (Vaswani et al. 2017) and
its variants such as BERT (Devlin et al. 2019; Wang et al.
*This work has been done while doing internship at Amazon.
Corresponding author.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
𝑥𝑖
𝑧𝑗𝑦
𝑊𝑖𝑗
𝑥𝑖
𝑧𝑗𝑦
𝑊𝑖𝑗 = 𝜇𝑖𝑗 + 𝜖 ∗ 𝜎𝑖𝑗
𝜖~𝐺𝑎𝑢𝑠𝑠𝑖𝑜𝑛(0,1) 𝑥𝑖
𝑧𝑗𝑦
𝑊𝑖𝑗
𝑓𝑓𝑓
𝑥𝑖
𝑟𝑖
𝑟𝑖~𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1 − 𝑝)
𝑝 ∈ [0; 1]
𝑥𝑖
𝑧𝑗𝑦𝑗
𝑊𝑖𝑗
𝑓𝑗
𝑥𝑖
𝑧𝑘𝑦k
𝑊𝑖k
𝑓k
𝑥𝑖
𝑧𝑗𝑦
𝑓
𝑔𝑖
𝑦
𝑔𝑖~𝐺𝑢𝑚𝑏𝑒𝑙[−log(− log 𝑢 ]
𝑢~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0,1)
(a) (b) (c)
(d)
(e)
Figure 1: The methods of uncertainty estimation. (a) Deter-
ministic neural network outputs a single-point prediction;
(b) Bayesian neural network captures uncertainty via sam-
pling from a Gaussian distribution; (c) Variational dropout
captures uncertainty via sampling dropout masks from a
Bernoulli distribution; (d) Ensemble captures uncertainty
by combining multiple independently trained deterministic
models with different random seeds; (e) Gumbel-Softmax
trick for uncertainty estimation, the randomness comes from
the sampling categorical distribution from a Gumbel.
2020) are the representative state-of-the-art type of neural
architectures that have shown remarkable performance on
various recent Natural Language Processing (NLP) (Gillioz
et al. 2020) and Information Retrieval (IR) (Ren et al. 2021)
tasks.
Although transformers excel in terms of predictive per-
formance (Tetko et al. 2020; Han et al. 2021), they do not
offer the opportunity for practitioners to inspect the model
confidence due to their deterministic nature, i.e., incapabil-
ity to assess if transformers are confident about their pre-
dictions. This influence is non-trivial because transformers
are cutting-edge basic models for NLP. Thus, estimating the
predictive uncertainty of transformers benefits a lot in terms
of building and examining model reliability for the down-
stream tasks.
To estimate the uncertainty of neural models’ prediction,
one common way is to inject stochasticity (e.g., noise or ran-
domness) (Kabir et al. 2018; Gawlikowski et al. 2021). It
enables models to output a predictive distribution, instead
arXiv:2112.13776v1 [cs.CL] 27 Dec 2021
of a single-point prediction. Casting a deterministic trans-
former to be stochastic requires us to take the training and
inference computational complexity into consideration, be-
cause uncertainty estimation usually relies on multiple for-
ward runs. Therefore, directly adapting the aforementioned
methods is not desired, given the huge amount of parameters
and architectural complexity of transformers.
Figure 1 outlines deterministic transformer (Figure 1(a))
and the possible approaches (Figure 1(b-e) for making a
stochastic transformer. BNN (Figure 1(b)) assumes the net-
work weights follow a Gaussian or a mixture of Gaus-
sian (Blundell et al. 2015), and tries to learn the weight dis-
tribution (µ, σ), instead of weight Witself, with the help of
re-parameterization trick (Kingma and Welling 2014). That
means, BNN doubles the number of parameters. This is par-
ticularly challenging for a large network like a transformer,
which has millions of parameters to be optimized. To alle-
viate this issue, MC dropout (Gal and Ghahramani 2016)
(Figure 1(c)) uses dropout (Srivastava et al. 2014), con-
cretely Bernoulli distributed random variables, to approxi-
mate the exact posterior distribution (Gal and Ghahramani
2016). However, MC dropout tends to give overconfident
uncertainty estimation (Foong et al. 2019). Ensemble (Lak-
shminarayanan, Pritzel, and Blundell 2017)(Figure 1(d)) is
an alternative way to model uncertainty by averaging Nin-
dependently trained models, which yields the computational
overhead by Ntimes in model training.
Unlike models above, we propose a simple yet effec-
tive approach based on Gumbel-Softmax tricks or Con-
crete Dropout (Jang, Gu, and Poole 2017; Maddison, Mnih,
and Teh 2017), which are independently found for con-
tinuous relaxation, to estimate uncertainty of transformers.
First, we cast the deterministic attention distribution for val-
ues in each self-attention head to be stochastic. The at-
tention is then sampled from a Gumbel-Softmax distribu-
tion, which controls the concentration over values. Second,
we regularize the key heads in self-attention to attend to
a set of learnable centroids. This is equivalent to perform-
ing clustering over keys (Vyas, Katharopoulos, and Fleuret
2020) or clustering hidden states in RNN (Wang and Niepert
2019; Wang, Lawrence, and Niepert 2021). Similar atten-
tion mechanism has been also used to allow the layers in
the encoder and decoder attend to inputs in the Set Trans-
former (Lee et al. 2019) and to estimate attentive matri-
ces in the Capsule networks (Ahmed and Torresani 2019).
Third, each new key head will be formed with a mixture
of Gumbel-Softmax sampled centroids. The stochasticity
is injected by sampling from a Gumbel-Softmax distribu-
tion. This is different from BNN (sampling from Gaussian
distribution), MC-dropout (sampling from Bernoulli distri-
bution), Ensemble (the stochasticity comes from random
seeds in model training). With this proposed mechanism,
we approximate the vanilla transformer with a stochastic
transformer based on a hierarchical stochastic self-attention,
namely H-STO-TRANS, which enables the sampling of at-
tention distributions over values as well as over a set of
learnable centroids.
Our work makes the following contributions:
We propose a novel way to cast the self-attention in trans-
0.3 0.5 0.2 0.1 0.8 0.1
0.1 0.8 0.1
0.4 0.6
Embedding
Query
Key
Value
Attention
Centroids
(a) (b)
(c)
Figure 2: The illustration of multi-head self-attention in
deterministic and stochastic transformers. (a) The vanilla
transformer with deterministic self-attention. (b) Stochastic
transformer has stochastic self-attention used to weight val-
ues V, the standard Softmax is replaced with the Gumbel-
Softmax. (c) Hierarchical stochastic transformer learns to
pay attention to values Vand a set of learnable centroids
Cstochastically.
formers to be stochastic, which enables transformer mod-
els to provide uncertainty information with predictions.
• We theoretically show that the proposed self-attention
approximation is upper bounded, the key attention heads
that are close in Euclidean distance have similar attention
distribution over centroids.
In two benchmark tasks for NLP, we empirically demon-
strate that H-STO -TR ANS (1) achieves very competi-
tive (in most cases, better) predictive performance on in-
domain datasets; (2) is on par with baselines in uncer-
tainty estimation on out-of-domain datasets; (3) learns a
better predictive performance-uncertainty trade-off than
compared baselines, i.e., high predictive performance
and low uncertainty on in-domain datasets, high predic-
tive performance and high uncertainty on out-of-domain
datasets.
2 BACKGROUND
Predictive Uncertainty
The predictive uncertainty estimation is a challenging and
unsolved problem. It has many faces, depending on different
classification rules. Commonly it is classified as epistemic
(model) and aleatoric (data) uncertainty (Der Kiureghian
and Ditlevsen 2009; Kendall and Gal 2017). Alternatively,
on the basis of input data domain, it can also be classi-
fied into in-domain (ID) (Ashukha et al. 2019) and out-of-
domain (OOD) uncertainty (Hendrycks and Gimpel 2017;
Wang and Van Hoof 2020). With in-domain data, i.e. the
input data distribution is similar to training data distribu-
tion, a reliable model should exhibit high predictive per-
formance (e.g., high accuracy or F1-score) and report high
confidence (low uncertainty) on correct predictions. On the
contrary, out-of-domain data has quite different distribution
from training data, an ideal model should give high predic-
tive performance to illustrate the generalization to unseen
data distribution, but desired to be unconfident (high uncer-
tainty). We discuss the epistemic (model) uncertainty in the
context of ID and OOD scenarios in this work.
Vanilla Transformer
The vanilla transformer (Vaswani et al. 2017) is an alter-
native architecture to Recurrent Neural Networks recurrent
neural networks (RNNs) for modelling sequential data that
relaxes the model’s reliance on input sequence order. It con-
sists of multiple components such as positional embedding,
residual connection and multi-head scaled dot-product at-
tention. The core component of the transformer is the multi-
head self-attention mechanism.
Let xRl×d(lis sequence length, dis dimension) be
input data, and Wq,Wk,WvRd×dbe the matrices for
query QRl×h×dh, key KRl×h×dh, and value V
Rl×h×dh,dh=d
hand his the number of attention heads.
Each xis associated with a query Qand a key-value pair
(K, V ). The computation of an attentive representation Aof
xin the multi-head self-attention is:
Q=Wqx;K=Wkx;V=Wvx;(1)
A=SOFTMAX(α1QK>); H=AV (2)
where H= [h1, ..., hh]is the multi-head output and A=
[a1, ..., ah]is the attention distribution that needs to attend to
V,αis a scaling factor. Note that a large value of αpushes
the Softmax function into regions where it has extremely
small gradients. This attention mechanism is the key factor
of a transformer for achieving high computational efficiency
and excellent predictive performance. However, as we can
see, all computation paths in this self-attention mechanism
are deterministic, leading to a single-point output. This lim-
its us to access and evaluate the uncertainty information be-
yond prediction given an input x.
We argue that the examination of the reliability and con-
fidence of a transformer prediction is crucial for many NLP
applications, particularly when the output of a model is di-
rectly used to serve customer requests. In the following sec-
tion, we introduce a simple yet efficient way to cast the de-
terministic attention to be stochastic for uncertainty estima-
tion based on Gumbel-Softmax tricks (Jang, Gu, and Poole
2017; Maddison, Mnih, and Teh 2017).
3 METHODOLOGY
Bayesian Inference and Uncertainty Modeling
In this work, we focus on using transformers in classification
tasks. Let D={X, Y }={xi, yi}N
i=1 be a training dataset,
yi∈ {1, ..., M }is the categorical label for an input xiRd.
The goal is to learn a transformation function f, which is
parameterized by weights ωand maps a given input xto a
categorical distribution y. The learning objective is to mini-
mize negative log likelihood, L=1
NPN
ilog p(yi|xi, ω).
The probability distribution is obtained by Softmax function
as:
p(yi=m|xi, ω) = exp(fm(xi, ω))
PkMexp(fk(xi, ω).(3)
In the inference phase, given a test sample x, the predictive
probability yis computed by:
p(y|x, D) = Zp(y|x, ω)p(ω|D)(4)
where the posterior p(ω|D)is intractable and cannot be
computed analytically. A variational posterior distribution
qθ(ω), where θare the variational parameters, is used to ap-
proximate the true posterior distribution by minimizing the
Kullback-Leilber (KL) distance. This can also be treated as
the maximization of evidence lower bound (ELBO):
Lθ=Zqθ(ω)p(Y|X, ω)KL[qθ(ω)kp(ω)] (5)
With the re-parametrization trick (Kingma, Salimans, and
Welling 2015), a differentiable mini-batched Monte Carlo
estimator can be obtained.
The predictive (epistemic) uncertainty can be measured
by performing Tinference runs and averaging predictions.
p(y∗ |x) = 1
T
T
X
t=1
pωt(y|x, ωt)(6)
Tcorresponds to the number of sets of mask vectors from
Bernoulli distribution {rt}T
t=1 in MC-dropout, or the num-
ber of randomly trained models in Ensemble, which po-
tentially leads to different set of learned parameters ω=
{ω1, ..., ωt}, or the number of sets of sampled attention dis-
tribution from Gumbel distribution {gt}T
t=1 in our proposed
method.
Stochastic Self-Attention with Gumbel-Softmax
As described in section 2, the core component that makes a
transformer successful is the multi-head self-attention. For
each i-th head, let qiQ, kiK, viV, it is written as:
ai=SOFTMAX(qik>
i
τ); aiRl×l(7)
hi=aivi;hiRl×dh(8)
We here use a temperature parameter τto replace the scaling
factor α. The aiis attention distribution, which learns the
compatibility scores between tokens in the sequence with
the i-th attention head. The scores are used to retrieve and
form the mixture of the content of values, which is a kind of
content-based addressing mechanism in neural Turing ma-
chine (Graves, Wayne, and Danihelka 2014). Note the atten-
tion is deterministic.
A straightforward way to inject stochasticity is to replace
standard Softmax with Gumbel-Softmax, which helps to
sample attention weights to form ˆai.
ˆai∼ G(qik>
i
τ)(9)
hi= ˆaivi(10)
where Gis GUMBEL-SOFTMAX function. The Gumbel-
Softmax trick is an instance of a path-wise Monte-Carlo gra-
dient estimator (Gumbel 1954; Maddison, Mnih, and Teh
2017; Jang, Gu, and Poole 2017). With the Gumbel trick,
we can draw samples zfrom a categorical distribution given
by parameters θ, that is, z=ON E HOTarg maxi[gi+
log θi], i [1 . . . k], where kis the number of categories
and giare i.i.d. samples from the GUMBEL(0,1), that is,
g=log(log(u)), u UNIFORM(0,1) is indepen-
dent to network parameters. Because the arg max operator
breaks end-to-end differentiability, the categorical distribu-
tion zcan be approximated using the differentiable Soft-
max function (Jang, Gu, and Poole 2017; Maddison, Mnih,
and Teh 2017). Here the τis a tunable temperature param-
eter equivalent to αin Eq. (2), Then the attention weights
(scores) for values in Eq.2 can be computed as:
ˆai=exp((log(θi) + gi))
Pk
j=1 exp((log(θtj) + gj) ), i [1 . . . k].(11)
where the θi=qik>
i. And we use the following approxima-
tion:
KL[akˆa]where aj=aj
Pi=1
kai
(12)
This indicates an approximation of a deterministic atten-
tion distribution awith a stochastic attention distribution ˆa.
With a larger τ, the distribution of attention is more uniform,
and with a smaller τ, the attention becomes more sparse.
The trade-off between predictive performance and un-
certainty estimation. This trade-off is rooted in bias-
variance trade-off. Let φ(x)be a prediction function, and
f(x)is the true function and ρbe a constant number. The
error can be computed as:
ξ(x)=(E[φ(x)f(x)])2
| {z }
Bias2
+ (E[φ(x)E[φ(x)]]2)
| {z }
V ariance
+ρ
|{z}
Const
(13)
MC-dropout (Gal and Ghahramani 2016) with Ttimes
Monte Carlo estimation gives a prediction E[φt(x)], t T
and predictive uncertainty, e.g., variances V ariance[φt(x)]
(ρis a constant number denotes irreducible error). On both
in-domain and out-of-domain datasets, a good model should
exhibit low bias, which ensures model generalization ca-
pability and high predictive performance. For epistemic
(model) uncertainty, we expect model outputs low variance
on in-domain data and high variance on out-of-domain data.
We empirically observe (from Table 1 and Table 2) that
this simple modification in Eq. (9) can effectively capture
the model uncertainty, but it struggles to learn a good trade-
off between predictive performance and uncertainty estima-
tion. That is, when good uncertainty estimation performance
is achieved on out-of-domain data, the predictive perfor-
mance on in-domain data degrades. To address this issue, we
propose a hierarchical stochastic self-attention mechanism.
Hierarchical Stochastic Self-Attention
To further encourage transformer model to have stochastic-
ity and retain predictive performance, we propose to add
an additional stochastic attention before the attention that
pays values. This attention forces each key head stochasti-
cally attend to a set of learnable centroids, which will be
learned during back-propagation. This is equivalent to regu-
larizing key attention heads. Similar ideas have been used to
improve transformer efficiency (Vyas, Katharopoulos, and
Fleuret 2020) and to improve RNN memorization (Wang
and Niepert 2019).
We first define the set of ccentroids, CRdh×c. Let
each centroid ciRdhhave the same dimension with each
key head kjRdh. The model will first learn to pay at-
tention to centroids, and a new key head ˆ
kjis formed by
weighting each centroid. Then ˆ
kand a query qdecides the
attention weights to combine values v. For the i-th head, a
given query qi, key ki, value vi, the stochastic self-attention
can be hierarchically formulated as:
ˆac∼ G(τ1
1kiC),ˆacRl×c(14)
ˆ
ki= ˆacC>,ˆ
kiRl×dh(15)
ˆav∼ G(τ1
2qiˆ
k>
i),ˆavRl×l(16)
hi= ˆavvi(17)
ˆac,ˆavare the sampled categorical distributions that are used
to weight centroids in Cand tokens in vi. The τ1, τ2control
the softness for each stochastic self-attention, respectively.
We summarize the main procedures of performing hier-
archical stochastic attention in transformer in Algorithm 1.
Algorithm 1: Hierarchical stochastic transformer.
Input : query Q, key K, value V, centroids C
Output: Hierarchical stochastic attentive output H
1Model stochastic attention ˆ
Acover centroids Cas Eq.14;
2Sample ˆ
Acfrom a categorical distribution
z=ON E HOTarg maxi[gi+ log θi], i [1 . . . k ],
g=log(log(u)), u UNIFORM(0,1) ;
3Differentially approximate ˆ
Acas Eq. 11;
4Compute ˆ
K=ˆ
AcC>as Eq. 15;
5Model stochastic attention ˆ
Avover value Vas Eq.16;
6Sample and approximate ˆ
Av, similar to line 2 to 3;
7Compute H=ˆ
AvVas Eq. 17;
Why perform clustering on key heads? The equation
(14) performs clustering on the key attention heads and out-
puts an attention distribution, and equation (15) tries to form
a new head based on attention distribution and learned cen-
troids. The goal is to make the original key heads to be
stochastic, allowing attention distribution to have random-
ness for uncertainty estimation. This goal can be also accom-
panied by applying equations (14) and (15) to query while
keeping key unchanged. In that case, ˆaccan be still sampled
stochastically based on query and centroids.
Stochastic attention approximation. The equations (14)
and (15) group the key heads into a fixed number of cen-
troids and are reweighed by the mixture of centroids. As
in (Vyas, Katharopoulos, and Fleuret 2020), we can ana-
lyze the attention approximation error, and derive that the
key head attention difference is bounded.
Proposition 3.1 Given two keys kiand kjsuch that
kkikjk2ε, stochastic key attention difference is
bounded:
G(τ1kiC)) − G(τ1kjC))
2τ1εkCk2,
where Gis the Gumbel-Softmax function, and kCk2is the
spectral norm of centroids. εand τare constant numbers.
Proof 3.1 Same to the Softmax function, which has Lips-
chitz constant less than 1 (Gao and Pavel 2017), we have
the following derivation:
G(τ1kiC)) − G(τ1kjC))
2
τ1kiCτ1kjC
2
τ1εkCk2
(18)
Proposition 3.1 shows that the i-th key assigned to j-th
centroid can be bounded by its distance from j-th centroid.
The keys that are close in Euclidean space have similar at-
tention distribution over centroids.
4 EXPERIMENTAL SETUPS
We design experiments to achieve the following objectives:
To evaluate the predictive performance of models on in-
domain datasets. High predictive scores and low uncer-
tainty scores are desired.
To compare the model generalization from in-domain to
out-of-domain datasets. High scores are desired.
• To estimate the uncertainty of the models on out-of-
domain datasets. High uncertainty scores are desired.
To measure the model capability in learning the predic-
tive performance and uncertainty estimation trade-off.
Datasets
We use IMDB dataset1(Maas et al. 2011) for the sentiment
analysis task. The standard IMDB has 25,000/25,000 re-
views for training and test, covering 72,062 unique words.
For hyperparameter selection, we take 10% of training data
as validation set, leading to 22,500/2,500/25,000 data sam-
ples for training, validation, and testing. Besides, we use
customer review (CR) dataset (Hendrycks and Gimpel 2017)
which has 500 samples to evaluate the proposed model in
OOD settings. We conduct the second experiment on lin-
guistic acceptability task with CoLA dataset2(Warstadt,
Singh, and Bowman 2019). It consists of 8,551 training and
527 validation in-domain samples. As the labels of test set
is not publicly available, we split randomly the 9078 in-
domain samples into train/valid/test with 7:1:2. Addition-
ally, we use the provided 516 out-of-domain samples for un-
certainty estimation.
Compared Methods.
We compare the following methods in our experimental
setup:
• TR ANS (Vaswani et al. 2017): The vanilla transformer
with deterministic self-attention.
• MC-DROPOUT (Gal and Ghahramani 2016): Using
dropout (Srivastava et al. 2014) as a regularizer to mea-
sure the prediction uncertainty.
• ENSEMBLE (Lakshminarayanan, Pritzel, and Blundell
2017): Average over multiple independently trained
transformers.
1https://ai.stanford.edu/ amaas/data/sentiment/
2https://nyu-mll.github.io/CoLA/
STO-TRANS: The proposed method that the attention dis-
tribution over values is stochastic;
• H-STO-TRAN: The proposed method that uses hierarchi-
cal stochastic self-attention, i.e., the stochastic attention
from key heads to a learnable set of centroids and the
stochastic attention to value, respectively.
Implementation details
We implement models in PyTorch (Paszke et al. 2019). The
models are trained with Adam (Kingma and Ba 2014) as the
optimization algorithm. For each trained model, we sam-
ple 10 predictions (run inference 10 times), the mean and
variance (or standard deviation) of results are reported. The
uncertainty information is quantified with variance (or stan-
dard deviation). For sentiment analysis, we use 1 layer with
8 heads, both the embedding size and the hidden dimen-
sion size are 128. We train the model with learning rate
of 1e-3, batch size of 128, and dropout rate of 0.5/0.1. We
evaluate models at each epoch, and the models are trained
with maximum 50 epochs. We report accuracy as the eval-
uation metric. For linguistic acceptability, we use 8 layers
and 8 heads, the embedding size is 128 and the hidden di-
mension is 512. We train the model with learning rate of
5e-5, batch size of 32 and dropout rate of 0.1. We train the
models with maximum 2000 epochs and evaluate the mod-
els at every 50 epochs. We use Matthews correlation coeffi-
cient (MCC) (Matthews 1975) as the evaluation metric. The
model selection is performed based on validation dataset ac-
cording to predictive performance.
5 EXPERIMENTAL RESULTS
Results on Sentiment Analysis
Table 1 presents the predictive performance and uncer-
tainty estimation on IMDB (in-domain, ID) and CR (out-
of-domain, OOD) dataset, evaluated by accuracy.
First, STO-TRANS and H-STO -TRANS are able to pro-
vide uncertainty information, as well as maintain and even
slightly outperform the predictive performance of TRANS.
Specially, STO-TRANS (τ= 40) and H-S TO-TRANS (τ1=
1,τ2= 30) outperforms TRANS (η= 0.1) by 0.42% and
0.66% on ID dataset. In addition, they allow us to measure
the uncertainty via predictive variances. It is because they in-
ject randomness directly to self-attentions. However, TRANS
has no access to uncertainty information due to its determin-
istic nature.
Second, STO-TRANS is struggling to learn a good trade-
off between ID predictive performance and OOD uncer-
tainty estimation performance. With small temperature τ=
1,STO-TRANS gives good uncertainty information, but we
observe that the ID predictive performance drops. When
τapproaches to pd/h (the original scaling factor in the
vanilla transformer), STO-TRANS achieves better perfor-
mance on ID dataset, but lower performance on OOD
dataset. We conjecture that the randomness in S TO-TRANS
is solely based on the attention distribution over values and
is not enough for learning the trade-off.
Third, H-STO-TRANS achieves better accuracy-
uncertainty trade-off compared with STO-TRANS. For
Figure 3: The experiments with hyperparameter τ. Left: STO-TRANS with different τ. The randomness is solely based on
the sampling on attention distribution over values. While uncertainty information is captured, S TO-TRANS has difficulties in
learning the trade-off between in-domain and out-of-domain performance. Middle: The hyperparameter tuning of τ1and τ2in
H-STO-TRANS.τ1controls the concentration on centroids and τ2controls the concentration on values.
ID (%) OOD (%) 5ID (%) 5OOD (%)
TRANS (η= 0.1) 87.00 65.00 / /
TRANS (η= 0.5) 87.51 63.40 0.51 1.60
MC-DRO POU T (η= 0.5) 86.06 ±0.087 63.38 ±1.738 0.94 1.62
MC-DRO POU T (η= 0.1) 87.01 ±0.075 63.38 ±0.761 0.10 1.62
ENSEMBLE 86.89 ±0.230 64.20 ±1.585 0.11 0.80
STO-TRANS (τ= 1) 82.62 ±0.092 67.92 ±0.634 4.38 2.92
STO-TRANS (τ= 40) 87.42 ±0.022 63.78 ±0.289 0.42 1.22
H-STO-TRANS (τ1= 1,τ2= 20) 87.63 ±0.017 67.14 ±0.400 0.63 2.14
H-STO-TRANS (τ1= 1,τ2= 30) 87.66 ±0.022 66.72 ±0.271 0.66 1.72
Table 1: The predictive performance and uncertainty estimation of models on IMDB (ID) and CR (OOD) dataset. The uncer-
tainty estimation is performed by running forward pass inference by 10 runs, then the uncertainty is quantified by standard
deviation across runs. For ensemble, the results are averaged on 10 models that are independently trained with random seeds.
Dropout is used in the inference of MC -DR OPO UT and ηis dropout rate. In the rest of methods, dropout is not used in inference.
The 5ID (%) and 5OOD (%) present the predictive performance difference to TRANS (η= 0.1).
instance, with τ1= 1, τ2= 20,H-STO-TRANS achieves
87.63% and 67.14%, which outperform the corresponding
numbers of ST O-TRANS for both ID and OOD datasets. It
also outperforms MC -DROPO UT and ENSEMBLE, specially,
H-STO-TRANS outperforms 0.62%-1.6% and 2.52%-3.76%
on ID and OOD datasets, respectively. On OOD dataset,
while MC-DRO POU T and ENSEMBLE exhibit higher un-
certainty (measured by standard deviation) across runs,
the accuracy is lower than that of TRANS (η= 0.1),
STO-TRANS (τ= 1) and H-ST O-TRANS. It is due to a
better way of learning two types of randomness: one from
sampling over a set of learnable centroids and the other one
from sampling attention over values.
Figure 3 reports the hyperparameter tuning of τ1and τ2.
The goal is to find a reasonable combination to achieve high
predictive performance on both ID and OOD datasets. To
simplify the tuning work, we fix the τ1= 1 and then change
τ2with different values, and vice versa. As we can see, the
combination of a small τ1and a large τ2performs better than
the other way around. We think this is because τ2is in the
latter stage and has bigger effects on the predictive perfor-
mance. However, removing τ1goes back to Figure 3 (Left),
where accuracy-uncertainty trade-off is not well learned by
STO-TRANS.
Results on Linguistic Acceptability
Table 2 shows the performance of compared models on both
in-domain (ID) and out-of-domain (OOD) sets of CoLA
dataset, evaluated by MCC.
Models ID(%) OOD(%) 5ID (%) 5OOD (%)
TRAN S (η= 0.1) 20.09 16.46 / /
MC-DROPOUT (η= 0.1) 19.91 ±0.40 16.70 ±2.21 0.18 0.24
MC-DROPOUT (η= 0.05) 20.03 ±0.30 17.11 ±1.21 0.06 0.65
ENSEMBLE 21.20 ±2.59 16.73 ±4.92 1.11 0.27
STO -TRANS 23.27 ±0.75 15.25 ±4.65 3.18 1.21
H-STO -TRANS 20.52 ±0.76 16.49 ±4.08 0.43 0.03
Table 2: The performance of compared models on CoLA
dataset. We set all temperature values τ1= 1 and τ2= 1.
The 5ID (%) and 5OOD (%) present the predictive perfor-
mance and difference to TRANS (η= 0.1), respectively.
First, STO-TRANS and H-STO -TRANS obtain compara-
ble performance as well as provide uncertainty information,
compared with TRANS. To be specific, S TO-TRANS and H-
STO-TRANS improves 3.18% and 0.43% of MCC on ID
dataset compared with deterministic TRANS respectively.
Second, STO-TRANS achieves the best performance on
ID dataset but the worst performance on OOD dataset.
Although STO-TRANS outperforms TRANS, the best MC-
DROPOUT,ENSEMBLE by 3.18%, 3.24%, 2.07% of MCC on
ID dataset, its performance drops by 1.21%, 1.86%, 1.48%,
correspondingly on OOD dataset. This further verifies our
conjecture that the randomness is only introduced to atten-
tion distribution over values and is insufficient for learning
the trade-off of ID and OOD data.
Examples (Labels) Prob. Corr. Corr./Total
no man has ever beaten the centaur. (1) 0.75 ±0.001 10/10
nora sent the book to london (1) 0.65 ±0.007 10/10
sally suspected joe, but he did n’t holly. (1) 0.60 ±0.008 8/10
kim is eager to recommend. (0) 0.41 ±0.011 3/10
he analysis her was flawed (0) 0.24 ±0.003 0/10
sandy had read how many papers ? ! (1) 0.67 ±0.010 10/10
which book did each author recommend ? (1) 0.58 ±0.010 7/10
she talked to harry , but i do n’t know who else .(1) 0.52 ±0.013 4/10
john is tall on several occasions . (0) 0.42 ±0.005 1/10
they noticed the painting , but i do n’t know for how long . (0)0.28 ±0.003 0/10
Table 3: Illustration of predictions with H-S TO-TRANS. The
predictions for each ID (top) and OOD (bottom) samples are
measured by the probability of being correct of each predic-
tion and the number of correct predictions.
Third, H-STO-TRANS enabled to learn better trade-off
between prediction and uncertainty. Precisely, the perfor-
mance improves 0.43% and 0.03% of MCC on ID and OOD
datasets respectively. H-STO-TRANS is 0.49% superior to
MC-DROPOUT (η= 0.05), meanwhile, 0.68% inferior to
ENSEMBLE on ID dataset. Given ENSEMBLE shows high un-
certainty on ID dataset and MC -DROPOUT (η= 0.05) has
low uncertainty on OOD dataset, this is not desired. There-
fore, H-STO-TRANS is the one that strikes the better balance
across the objectives. In the context of this task, it means
high MCC, low variance on ID dataset and high MCC, high
variance on OOD dataset.
Table 3 gives some predictions of test samples with H-
STO-TRANS. What we observed are two folds: (1) In gen-
eral, ID predictions have lower variances in terms of the
probability of being correct. For “10/10” (10 correct predic-
tions out of 10 total predictions) prediction cases, the ID ex-
amples have higher probability score than the ones in OOD
data. Also, we find there are much less number of “10/10”
prediction cases in OOD dataset than that in ID dataset. (2)
For ID dataset, either with high or low probability scores, we
can see low variances, we see more “10/10” (tend to be con-
fidently correct) or “0/10” (tend to be confidently incorrect)
cases. As expected, for both cases, the variance is relatively
low as compared to probability around 0.5. In deterministic
models, we are not able to access this kind of information
which would imply how confident are the transformer mod-
els towards predictions.
6 RELATED WORK
Bayesian neural networks (Blundell et al. 2015) inject
stochasticity by sampling the network parameters from a
Gaussian prior. Then the posterior distribution of target
can be estimated in multiple sampling runs. However, the
Bayesian approach doubles the number of network param-
eters, i.e., instead of learning a single-point network pa-
rameter, it learns a weight distribution which is assumed
to follow a Gaussian distribution. Additionally, it often re-
quires intensive tuning work on Gaussian mean and vari-
ance to achieve stable learning curves as well as predictive
performance. MC dropout (Gal and Ghahramani 2016) ap-
proximates Bayesian approach by sampling dropout masks
from a Bernoulli distribution. However, MC dropout has
been demonstrated to give overconfident uncertainty estima-
tion (Foong et al. 2019). Alternatively, the recently proposed
deep ensembles (Lakshminarayanan, Pritzel, and Blundell
2017) offers possibility to estimate predictive uncertainty
by combining predictions from different models which are
trained with different random seeds. This, however, signifi-
cantly increases the computational overhead for training and
inference. There are some MC dropout based methods re-
cently proposed. Sequential MC transformer (Martin et al.
2020), which models uncertainty by casting self-attention
parameters as unobserved latent states by evolving randomly
through time. (He et al. 2020) combined mix-up, self-
ensembling and dropout to achieve more accurate uncer-
tainty score for text classification. (Shelmanov et al. 2021)
proposed to incorporate determinantal point process (DPP)
to MC dropout to quantify the uncertainty of transform-
ers. Different to the above-mentioned approaches, we in-
ject stochasticity into the vanilla transformer with Gumbel-
Softmax tricks. As it is shown in the experiment section,
hierarchical stochastic self-attention component can effec-
tively capture model uncertainty, and learn a good trade-
off between in-domain predictive performance and out-of-
domain uncertainty estimation.
7 DISCUSSION
While many extension of transformers have been recently
proposed, the most of transformer variants are still deter-
ministic. Our goal in this work is to equip transformers in a
stochastic way to estimate uncertainty while retaining the
original predictive performance. This requires special de-
sign in order to achieve the two goals without adding a ma-
jor computational overhead to model training and inference
like Ensembles and Bayesian Neural Network (BNN). The
complexity gain of our method to its deterministic version is
modest and requires an additional matrix CRdh×c. This
is more efficient than Ensemble and BNN, which gives N
(N2for Ensemble and N= 2 for BNN) times more
weights.
8 CONCLUSION
This work proposes a novel, simple yet effective way to en-
able transformers with uncertainty estimation, as an alterna-
tive to MC dropout and ensembles. We propose variants of
transformers based on two stochastic self-attention mecha-
nisms: (1) injecting stochasticity into the stochastic attention
over values; (2) forcing key heads to pay stochastic atten-
tion to a set of learnable centroids. Our experimental results
show that the proposed approach learns good trade-offs be-
tween in-domain predictive performance and out-of-domain
uncertainty estimation performance on two NLP benchmark
tasks, and outperforms baselines.
References
Ahmed, K.; and Torresani, L. 2019. Star-caps: Capsule net-
works with straight-through attentive routing. NeurIPS’19,
32: 9101–9110.
Ashukha, A.; Lyzhov, A.; Molchanov, D.; and Vetrov, D.
2019. Pitfalls of in-domain uncertainty estimation and en-
sembling in deep learning. In ICLR’19.
Baker, S. R.; Bloom, N.; Davis, S. J.; and Terry, S. J. 2020.
Covid-induced economic uncertainty. Technical report, Na-
tional Bureau of Economic Research.
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra,
D. 2015. Weight uncertainty in neural network. In ICML’15,
1613–1622. PMLR.
Der Kiureghian, A.; and Ditlevsen, O. 2009. Aleatory or
epistemic? Does it matter? Structural safety, 31(2): 105–
112.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of deep bidirectional transformers for
language understanding. In NAACL-HLT’19, 4171–4186.
Foong, A. Y.; Li, Y.; Hern´
andez-Lobato, J. M.; and Turner,
R. E. 2019. ‘in-between’ uncertainty in Bayesian neural net-
works. In ICML’19 Workshop.
Gal, Y.; and Ghahramani, Z. 2016. Dropout as a bayesian ap-
proximation: Representing model uncertainty in deep learn-
ing. In ICML’16, 1050–1059. PMLR.
Gao, B.; and Pavel, L. 2017. On the properties of the softmax
function with application in game theory and reinforcement
learning. arXiv preprint arXiv:1704.00805.
Gawlikowski, J.; Tassi, C. R. N.; Ali, M.; Lee, J.; Humt,
M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.;
et al. 2021. A survey of uncertainty in deep neural networks.
arXiv preprint arXiv:2107.03342.
Ghoshal, B.; and Tucker, A. 2020. Estimating uncer-
tainty and interpretability in deep learning for coronavirus
(COVID-19) detection. arXiv preprint arXiv:2003.10769.
Gillioz, A.; Casas, J.; Mugellini, E.; and Abou Khaled, O.
2020. Overview of the Transformer-based models for NLP
tasks. In FedCSIS’19, 179–183. IEEE.
Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural tur-
ing machines. arXiv preprint arXiv:1410.5401.
Gumbel, E. J. 1954. Statistical theory of extreme values and
some practical applications. A series of lectures. Number
33. US Govt. Print. Office.
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; and Wang,
Y. 2021. Transformer in transformer. arXiv preprint
arXiv:2103.00112.
He, J.; Zhang, X.; Lei, S.; Chen, Z.; Chen, F.; Alhamadani,
A.; Xiao, B.; and Lu, C. 2020. Towards More Accurate Un-
certainty Estimation In Text Classification. In EMNLP’20,
8362–8372.
Hendrycks, D.; and Gimpel, K. 2017. A baseline for detect-
ing misclassified and out-of-distribution examples in neural
networks. In ICLR’17.
Hoel, C.-J.; Wolff, K.; and Laine, L. 2020. Tactical decision-
making in autonomous driving by reinforcement learning
with uncertainty estimation. In 2020 IEEE Intelligent Ve-
hicles Symposium (IV), 1563–1569. IEEE.
Jang, E.; Gu, S.; and Poole, B. 2017. Categorical reparame-
terization with gumbel-softmax. In ICLR’17.
Kabir, H. D.; Khosravi, A.; Hosen, M. A.; and Nahavandi,
S. 2018. Neural network-based uncertainty quantification:
A survey of methodologies and applications. IEEE access,
6: 36218–36234.
Kendall, A.; and Gal, Y. 2017. What Uncertainties Do
We Need in Bayesian Deep Learning for Computer Vision?
NeurIPS’17, 30: 5574–5584.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
Kingma, D. P.; Salimans, T.; and Welling, M. 2015. Vari-
ational dropout and the local reparameterization trick. In
NeurIPS’15, volume 28, 2575–2583.
Kingma, D. P.; and Welling, M. 2014. Auto-encoding varia-
tional bayes. In ICLR’14.
Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017.
Simple and scalable predictive uncertainty estimation using
deep ensembles. In NeurIPS’17.
Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; and Teh,
Y. W. 2019. Set transformer: A framework for attention-
based permutation-invariant neural networks. In Inter-
national Conference on Machine Learning, 3744–3753.
PMLR.
Lin, G.; Engel, D. W.; and Eslinger, P. W. 2012. Survey and
evaluate uncertainty quantification methodologies. Techni-
cal report, Pacific Northwest National Lab.(PNNL), Rich-
land, WA (United States).
Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.;
and Potts, C. 2011. Learning word vectors for sentiment
analysis. In NAACL-HLT’11, 142–150. Portland, Oregon,
USA: Association for Computational Linguistics.
Maddison, C. J.; Mnih, A.; and Teh, Y. W. 2017. The con-
crete distribution: A continuous relaxation of discrete ran-
dom variables. In ICLR’17.
Martin, A.; Ollion, C.; Strub, F.; Corff, S. L.; and Pietquin,
O. 2020. The Monte Carlo Transformer: a stochastic self-
attention model for sequence prediction. arXiv preprint
arXiv:2007.08620.
Matthews, B. W. 1975. Comparison of the predicted and ob-
served secondary structure of T4 phage lysozyme. Biochim-
ica et Biophysica Acta (BBA)-Protein Structure, 405(2):
442–451.
Oh, G.; and Hong, Y. S. 2021. Managing market risk caused
by customer preference uncertainty in product family design
with launch flexibility: Product option strategy. Computers
& industrial engineering, 151: 106975.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga,
L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison,
M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai,
J.; and Chintala, S. 2019. PyTorch: An imperative style,
high-performance deep learning library. In Wallach, H.;
Larochelle, H.; Beygelzimer, A.; d'Alch´
e-Buc, F.; Fox, E.;
and Garnett, R., eds., NeuIPS’19, 8024–8035.
Ren, P.; Chen, Z.; Ren, Z.; Kanoulas, E.; Monz, C.; and
De Rijke, M. 2021. Conversations with search engines:
SERP-based conversational response generation. TOIS’20,
39(4): 1–29.
Riedmaier, S.; Danquah, B.; Schick, B.; and Diermeyer, F.
2021. Unified framework and survey for model verification,
validation and uncertainty quantification. Archives of Com-
putational Methods in Engineering, 28(4): 2655–2688.
Shelmanov, A.; Tsymbalov, E.; Puzyrev, D.; Fedyanin, K.;
Panchenko, A.; and Panov, M. 2021. How certain is your
transformer? In EACL’21, 1833–1840.
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Salakhutdinov, R. 2014. Dropout: A simple way to prevent
neural networks from overfitting. Journal of Machine Learn-
ing Research, 15(56): 1929–1958.
Tetko, I. V.; Karpov, P.; Van Deursen, R.; and Godin, G.
2020. State-of-the-art augmented NLP transformer models
for direct and single-step retrosynthesis. Nature communi-
cations, 11(1): 1–11.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In NeurIPS’17, 5998–6008.
Vyas, A.; Katharopoulos, A.; and Fleuret, F. 2020. Fast
transformers with clustered attention. NeuIPS’20.
Wang, B.; Shang, L.; Lioma, C.; Jiang, X.; Yang, H.; Liu,
Q.; and Simonsen, J. G. 2020. On position embeddings in
bert. In ICLR’20.
Wang, C.; Lawrence, C.; and Niepert, M. 2021. Uncertainty
Estimation and Calibration with Finite-State Probabilistic
RNNs. In ICLR’21.
Wang, C.; and Niepert, M. 2019. State-regularized recurrent
neural networks. In ICML’19, 6596–6606. PMLR.
Wang, Q.; and Van Hoof, H. 2020. Doubly Stochastic Vari-
ational Inference for Neural Processes with Hierarchical La-
tent Variables. In ICML’20, 10018–10028. PMLR.
Warstadt, A.; Singh, A.; and Bowman, S. R. 2019. Neural
network acceptability judgments. Transactions of the Asso-
ciation for Computational Linguistics, 7: 625–641.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We investigated the effect of different training scenarios on predicting the (retro)synthesis of chemical compounds using text-like representation of chemical reactions (SMILES) and Natural Language Processing (NLP) neural network Transformer architecture. We showed that data augmentation, which is a powerful method used in image processing, eliminated the effect of data memorization by neural networks and improved their performance for prediction of new sequences. This effect was observed when augmentation was used simultaneously for input and the target data simultaneously. The top-5 accuracy was 84.8% for the prediction of the largest fragment (thus identifying principal transformation for classical retro-synthesis) for the USPTO-50k test dataset, and was achieved by a combination of SMILES augmentation and a beam search algorithm. The same approach provided significantly better results for the prediction of direct reactions from the single-step USPTO-MIT test set. Our model achieved 90.6% top-1 and 96.1% top-5 accuracy for its challenging mixed set and 97% top-5 accuracy for the USPTO-MIT separated set. It also significantly improved results for USPTO-full set single-step retrosynthesis for both top-1 and top-10 accuracies. The appearance frequency of the most abundantly generated SMILES was well correlated with the prediction outcome and can be used as a measure of the quality of reaction prediction.
Article
Full-text available
Simulation is becoming increasingly important in the development, testing and approval process in many areas of engineering, ranging from finite element models to highly complex cyber-physical systems such as autonomous cars. Simulation must be accompanied by model verification, validation and uncertainty quantification (VV&UQ) activities to assess the inherent errors and uncertainties of each simulation model. However, the VV&UQ methods differ greatly between the application areas. In general, a major challenge is the aggregation of uncertainties from calibration and validation experiments to the actual model predictions under new, untested conditions. This is especially relevant due to high extrapolation uncertainties, if the experimental conditions differ strongly from the prediction conditions, or if the output quantities required for prediction cannot be measured during the experiments. In this paper, both the heterogeneous VV&UQ landscape and the challenge of aggregation will be addressed with a novel modular and unified framework to enable credible decision making based on simulation models. This paper contains a comprehensive survey of over 200 literature sources from many application areas and embeds them into the unified framework. In addition, this paper analyzes and compares the VV&UQ methods and the application areas in order to identify strengths and weaknesses and to derive further research directions. The framework thus combines a variety of VV&UQ methods, so that different engineering areas can benefit from new methods and combinations. Finally, this paper presents a procedure to select a suitable method from the framework for the desired application.
Conference Paper
Full-text available
Neural processes (NPs) constitute a family of vari-ational approximate models for stochastic processes with promising properties in computational efficiency and uncertainty quantification. These processes use neural networks with latent variable inputs to induce predictive distributions. However, the expressiveness of vanilla NPs is limited as they only use a global latent variable, while target-specific local variation may be crucial sometimes. To address this challenge, we investigate NPs systematically and present a new variant of NP model that we call Doubly Stochastic Variational Neu-ral Process (DSVNP). This model combines the global latent variable and local latent variables for prediction. We evaluate this model in several experiments, and our results demonstrate competitive prediction performance in multi-output regression and uncertainty estimation in classification .
Article
Full-text available
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
Article
In this article, we address the problem of answering complex information needs by conducting conversations with search engines , in the sense that users can express their queries in natural language and directly receive the information they need from a short system response in a conversational manner. Recently, there have been some attempts towards a similar goal, e.g., studies on Conversational Agent s (CAs) and Conversational Search (CS). However, they either do not address complex information needs in search scenarios or they are limited to the development of conceptual frameworks and/or laboratory-based user studies. We pursue two goals in this article: (1) the creation of a suitable dataset, the Search as a Conversation (SaaC) dataset, for the development of pipelines for conversations with search engines, and (2) the development of a state-of-the-art pipeline for conversations with search engines, Conversations with Search Engines (CaSE), using this dataset. SaaC is built based on a multi-turn conversational search dataset, where we further employ workers from a crowdsourcing platform to summarize each relevant passage into a short, conversational response. CaSE enhances the state-of-the-art by introducing a supporting token identification module and a prior-aware pointer generator, which enables us to generate more accurate responses. We carry out experiments to show that CaSE is able to outperform strong baselines. We also conduct extensive analyses on the SaaC dataset to show where there is room for further improvement beyond CaSE. Finally, we release the SaaC dataset and the code for CaSE and all models used for comparison to facilitate future research on this topic.
Article
The uncertainty of customer preference is one of major sources of product family design risk. Since the prediction of customer preference entails forecasting and estimation error, the uncertainty is inevitable. The present study proposes the novel method for module-based product family design, product option strategy. Product option strategy provides the launch flexibility to companies. The firm has the right to release a maximum-utility product alternative among the product option set, corresponding to actual customer preference. Launch flexibility saves commercialization cost resulting from ineffective product launch and enhances market share by possessing diverse product alternatives. The strategy is implemented by the two-stage decision model with heuristic algorithm. The structure of decision model accommodates the request of consumer electronic industry on logical decision mechanism to persuade internal stakeholders. The decision model strikes the balance between development goals of main internal stakeholders, market effectiveness and operational efficiency. The proposed strategy is applied to product family design of LCD TV. In the case study, the proposed method achieves higher market share with the same or less operational complexity (cost) but the advantage might be limited under complex product architecture. In the future, operational cost model will be elaborated for the equivalent comparison between market share and operational cost.