Conference PaperPDF Available

Automatic Type Inference with a Nested Latent Variable Model

Authors:

Abstract and Figures

We present the first steps towards a type inferential general latent feature model: the nested latent feature model (NLFM). The NLFM combines automatic type inference and latent feature modelling for heterogeneous data. By combining inference over type, as well as over latent features, during runtime, we condition each latent parameter on all others, consequently, we are able to measure the likelihood of an inferred column type, under the current setting of the latent feature (and vice versa). In so doing we receive confidence scores on a) our inferred column set and b) our latent features under the inferred column set-all within one model. In addition, we seek to make inference cheaper by efficient model design.
Content may be subject to copyright.
JMLR: Workshop and Conference Proceedings 113 AutoML 2018 - ICML Workshop
Automatic Type Inference
with a Nested Latent Variable Model
Neil Dhir neil.dhir@mindfoundry.ai
Davide Zilli davide.zilli@mindfoundry.ai
Tomasz Rudny tomasz.rudny@mindfoundry.ai
Alessandra Tosi alessandra.tosi@mindfoundry.ai
Mind Foundry Ltd, Ewert House, Ewert Place, Oxford, OX2 7DD, United Kingdom
Abstract
We present the first steps towards a type inferential general latent feature model: the
nested latent feature model (NLFM). The NLFM combines automatic type inference and
latent feature modelling for heterogeneous data. By combining inference over type, as well
as over latent features, during runtime, we condition each latent parameter on all others,
consequently we are able to measure the likelihood of an inferred column type, under the
current setting of the latent feature (and vice versa). In so doing we receive confidence
scores on a) our inferred column set and b) our latent features under the inferred column
set – all within one model. In addition, we seek to make inference cheaper by efficient
model design.
Keywords: Latent feature modelling, type inference, automatic data ingestion
1. Introduction
Automatic machine learning (AutoML) is a concept that has grown in prominence since the
inception of powerful machine learning methods. At its core AutoML seeks to progressively
augment human machine learning experts, to enable them to focus on data analysis, whilst
leaving more mundane tasks, such as data ingestion, to an automatic transformation and
identification system.
L
nR
nR
fx
n2Xiji2Zg
fx
n2Xij1ing
x
n2\[0+1
Discrete ontinuous
Figure 1: Different data types.
As the big data drive shows little sign of abating,
there has recently been a growing body of work which
attempts to automate and accelerate different stages
(Valera and Ghahramani,2017) of the pre-processing
pipeline. Hellerstein (2008) presents a qualitative
monograph on the subject of data-cleaning, noting
references and resources for automation. Similarly,
Kandel et al. (2011) discuss research directions in
data wrangling (diagnosing data quality issues and
manipulating data into a usable form), whereas Hern´andez-Lobato et al. (2014) touch upon
the subject of type inference i.e. estimating whether the object attributes belong to a
certain likelihood model family (see figure 1). Whilst it is trivial to construct simple heuris-
tics to distinguish between major types of data (e.g. continuous vs. discrete), it is much
harder to reason about sub-types. For example, presented with a univariate data-sample:
c
N. Dhir, D. Zilli, T. Rudny & A. Tosi.
Automatic Type Inference with NLFM
{10,12,15,20,25}, it is difficult to determine if this is an ordinal attribute (IAAF recognised
common distances for road races) or a numeric categorical attribute (Royal Navy pennant
numbers). The former has order, the latter does not.
While a few select methods exist for heterogeneous modelling and type inference, to our
knowledge, there exist no complete model which can do both. We present the nested latent
feature model (NLFM), an unsupervised and nonparametric method that allows the column
type to be learned conditioned on the inferred number of latent features and vice versa.
2. Type inferential general latent feature model
Posit that our observations (data) are stored in a design matrix Xof size N×D. We denote
object non the rows, as xn,[x1
n, x2
n, . . . , xd
n, . . . , xD
n]. In probabilistic low-rank matrix
factorisation (Valera and Ghahramani,2014) we assume that Xcan be approximated by
the Hadamard product XZB+between a binary feature matrix Z∈ {0,1}N×K(but
can also be e.g. Gaussian distributed) and a weight matrix BRK×D. Here is white
Gaussian noise ∼ N(0,Σ), independent of Zand Band uncorrelated across observations.
Database modelling typically assumes that column likelihood models (types) mdd
{1, . . . , D}are known a priori.Valera and Ghahramani (2017) proposed a Bayesian method
which accurately discovers statistical data types, by imposing a likelihood model prior,
where the possible set of types is a priori unknown so that the type set becomes a set
of sets L={m∈ Ld|d= 1 ...,D}where Ld,{mj|j= 1, . . . , J}specifies a set of
potential likelihood models, of size |Ld|=J, for column d.Valera and Ghahramani (2017)’s
method (from here on referred to as the general latent type model (GLTM)) concerns finding
likelihood-weights wd,[wd
1, . . . , wd
J] for each d, which captures the probability of type mj,
in attribute xdX. We will extend that model by combining it with the General Latent
Feature Model (GLFM) Valera et al. (2017).
The model described below is capable of complete1automatic heterogeneous latent fea-
ture modelling by virtue of automatic type inference. This is achieved, broadly, by wrapping
a type inference mechanism (we refer to this as the ‘inner’ model), with latent feature mod-
elling (outer). By interleaving nonparametric inference over the number of latent features
K, evaluating the inner model at this K, and then updating the posterior estimate of the
types, we propose a form of global optimisation over Kand data types. Under Bayes’ law
we can factorise the relevant likelihoods into two main components:
p(Z,B,V,A,L|X)p(X|Z,B,L)p(Z)p(B)
| {z }
§2.1
·p(L|V,A,Z)p(V)p(A)
| {z }
§2.2
·p(L) (1)
The first PMF describes a binary latent feature model, conditioned on heterogeneous data
types, discussed in Section 2.1. The second PMF concerns posterior estimates of data
types conditioned on binary latent features, discussed in Section 2.1. As standalone models
both the GLFM and GLTM require specific information which both possess (Kand L
respectively), but which is not shared in between them. This explicit sharing of information
is achieved in the NLFM (see Figure 2(c)).
1. By ‘complete’ we mean: data infusion, latent variable modelling and type prediction.
2
Automatic Type Inference with NLFM
w
x
m L
m
d f;    ; Dg
(a) GLTM.
Z
X
(b) GLFM.
w
ZV
x
X
m L
a
m
d f;    ; Dg
a
(c) NLFM.
Figure 2: Generative latent feature and type models. In Figure 2(a)the appropriate likelihood
model m∈ Ldis sought, in Figure 2(b)the set of likelihood models Lis assumed to be
known. Finally in Figure 2(c)the NLFM performs inference (nonparametrically) over:
latent features Z, latent feature weights wdwell as the observation types (i.e. likelihood
models m∈ L). For clarity and to avoid clutter we have omitted auxiliary parameters.
2.1. Latent feature modelling
The outer model consists of a nonparametric LFM. We seek to find the posterior distribution
of Zand B, where independence is assumed a priori for Zand B, the likelihood model
p(X|Z,B) and the feature prior p(B) are determined by the application (Doshi et al.,
2009). Our interest is that of latent feature modelling so the likelihood can be factorised as
p(X|Z,B,L) =
D
Y
d=1
N
Y
n=1
p(xd
n|zn,bd, md).(2)
The density p(Z) requires a flexible prior (Ghahramani and Griffiths,2006) and because
Zis an indicator matrix which tells us if a particular feature in Bis present in object n, we
want to determine Kat runtime but also leave it unbounded. Further, when we allow X
to include heterogeneous data types, we can capture the latent structure using the general
latent feature model (GLFM) (Valera et al.,2017) – seen in Figure 2(b). Under this model
the type set Lis assumed known. This is one of the drawbacks that we address by integrating
automatic type inference, which allows Lto be determined during inference time. We
discuss in Section 2.2 how posterior type estimates are produced. The model in Equation (2)
extends the LG-LFM in (Ghahramani and Griffiths,2006) and Valera et al. (2017) extended
one of the central tenets of that model, the Indian buffet process (IBP), to account for
heterogeneous data while maintaining the model complexity of conjugate models. The IBP
places a prior on binary matrices where the number of columns, corresponding to latent
features K, is potentially infinite and can be inferred from the data, and its utility rests
on two foundations. First, we can learn the model complexity from the data. Second,
Valera et al. (2017) argues that binary-valued latent features which have been shown to
provide more interpretable results in data exploration than standard real-valued latent
feature models (Ruiz et al.,2012,2014).
Finally, because we have an estimate of Lthis means our posterior estimate of Xin-
corporates our uncertainty about the type itself. This is important because it may provide
further insight into the generative process of a particular column, e.g. regarding its ordinal
(or not) nature. As we saw in the example in Section 1this is not a trivial exercise, and
3
Automatic Type Inference with NLFM
one which warrants automation for large datasets wherein data exploration is required or
desired.
2.2. Type inference
The inner part of the NLFM is also a LFM, but a parametric one – see Figure 2(a). As
demonstrated by Valera and Ghahramani (2017) this type of model requires the number of
latent features Kto be set a priori. However, type prediction is intimately linked to the size
of K, consequently it follows that joint inference over both type and Kis warranted. We
will demonstrate how information is passed between the outer model to the inner model,
and vice versa, to illicit efficient inference over latent features and latent types. Like before
we seek a low-rank representation of our observations
p(X|V,A,Z) =
D
Y
d=1
N
Y
n=1
p(xd
n|vn,ad,zn).(3)
where we are now investigating the right-hand part of Equation (1). Similar to before latent
features per row are captured by vnand stored in V, and feature weights by ad, stored
in matrix A. Herein the latent parameters that need inferring are the likelihood models
{md|d= 1, . . . , D}. But by explicitly conditioning on the binary feature matrix from the
outer model, we do not need to perform inference over the model complexity, as we posit
that this information is implicitly contained in Zon which we already have a posterior
estimate from the outer model. Hence Kis fixed from the outset but, crucially, it is an
inferred parameter unlike in (Valera and Ghahramani,2017).
Our interest lies in capturing the generative distribution of each attribute xdX. We
can entertain this problem using the main idea from (Valera and Ghahramani,2017), where
we assume that the likelihood model of xdXis a mixture of likelihood models
pm(xd|V,{ad
m}m∈Ld,Z) = X
m∈Ld
wd
m·pm(xd|V,ad
m,Z) (4)
with type-specific mixture weights given by wd
mand type-specific likelihood model by pm(·).
We have kept the dependence on Zto emphasise that model complexity is carried from
the outer model and not inferred in the inner model, this is the key difference between our
method and (Valera and Ghahramani,2017). The weight wd
mdenotes the probability that
model mis responsible for observations xdXfor column d. The usual provisos hold
for this mixture: Pm∈Ldwd
m= 1 and all densities pm(·) are normalised. Accordingly the
likelihood factorises as
p(X|V,{ad
m}m∈Ld,Z) =
D
Y
d=1 X
m∈Ld
wd
m·pm(xd|V,ad
m,Z).(5)
3. Inference
In the original LG-LFM by Ghahramani and Griffiths (2006) real-valued observations, in
conjunction with conjugate likelihoods, is what allows for fast inference algorithms (Valera
4
Automatic Type Inference with NLFM
et al.,2017). Indeed, exchangeable models often yield efficient Gibbs samplers (Williamson
et al.,2013). But because we are considering heterogeneous observations, such conjugacy
is no longer available. However using the mechanism described by Valera and Ghahramani
(2014), with Gaussian pseudo-observations (see appendix B for details), we are able to
transform the observation space so that it is amenable to efficient MCMC inference.
3.1. Binary latent features
Herein we describe the fundamental steps required for inference over the latent variables
in the outer model. Our approach is similar to that by (Valera et al.,2017;Doshi-Velez
and Ghahramani,2009) with the addendum that we condition on L. We use an instance of
the sampler proposed by Valera et al. (2017), where the probability of each element in Zis
given by
p(znk = 1 | {y}D
d=1,Znk ,L)mnk
M
D
Y
d=1
Sd
Y
r=1 Zbd
r
p(yd
nr |zn,bd
r)p(bd
r|yd
nr Zn) dbd
r(6)
where Sdis the number of columns in matrices Yand Bwhich contain categorical types
(note also the explicit conditioning on L. All other types render Sd= 1 i.e. when a column
is not categorical. Further Znis Zwith the nth line removed. Where yd
nr is the rth
column of matrix Y(r= 1 if the type is not categorical), without element yd
nr. Finally,
p(bd
r|yd
nr Zn) = N(bd
r|P1
nλd
nr,P1
n) is the posterior of the feature weight without
taking the nth observation into account (Valera et al.,2017). Valera et al. (2017) explain
that Pn=ZT
nZ+σ2
BIand λd
nr =ZT
nryd
nr are the natural parameters of the Gaussian
distribution.
3.2. Data types
The updates for the inner model are easier since there are no nonparametric mecha-
nisms, and feature vectors can be efficiently sampled. Again we take our queue from
Valera and Ghahramani (2017) but note that in these equations our latent state num-
ber Khas been inferred from the outer model – see Section 3.1. Now, row vnof Vis
simulated from vn∼ N(µd
v,Σv), herein Σvn=PD
d=1 Pm∈Ldad
m(ad
m)T+1
σ2
vnI1and
µvn=ΣvnPN
n=1 Pm∈Ldad
myd
nm. Similarly the feature weights updates have parameters
Σb=1
σ2
yVTV+1
σ2
bI1and µd
m=ΣaPN
n=1 vT
nyd
nm which means ad
m∼ N(µd
m,Σb).
The pseudo-observation updates are as described in (Valera and Ghahramani,2014,2017;
Valera et al.,2017) and can be found in Appendix B. Finally, the update of the likelihood
assignment sd
nis given by:
psd
n=m|wd,Z,V,A=wd
mpmxd
n|zn,ad
m
Pl0∈Ldwd
m0pm0xd
n|zn,ad
m0(7)
Finally, to echo Valera and Ghahramani (2017), we place a prior on the likelihood weight
vector wdfor dimension d, using a Dirichlet distribution parametrised by {αm}m∈Ld. Then
using the likelihood assignments in Equation (7), we exploit conjugacy to make parameter
updates.
5
Automatic Type Inference with NLFM
(a) Cultivar (NLFM)
0 2 4 6 8 10 12 14 16 18 20
Nested loops [#]
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [wd]
Categorical Ordinal Count
(b) Cultivar (GLTM)
(c) #Phenols (NLFM)
0 2 4 6 8 10 12 14 16 18 20
Nested loops [#]
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [wd]
Categorical Ordinal Count
(d) #Phenols (GLTM)
Figure 3: Results of running the NLFM on the red wine dataset (Dheeru and Karra Taniski-
dou,2017) using experimental set 1. Shown are the results from inference over
both the discrete variables in the dataset. The dashed line (---) in each plot,
shows the expected parameter value.
4. Experiments
We apply our method to an established dataset of red wines (Dheeru and Karra Taniskidou,
2017). The red wine dataset results from chemical analysis of wines grown in the same region
in Italy but derived from three different cultivars. The dataset has 13 attributes, mixed
continuous and discrete. The results in fig. 3demonstrate that Khas an effect on type since
the GLTM and the NLFM predicts different discrete types for the total number of phenols.
The NLFM gets it right, whereas the GLTM does not. Additional results and experimental
details are found in appendix D.
5. Discussion and Conclusion
In this paper we have presented a general model suitable for the analysis of heterogeneous
data: the NLFM. The NLFM combines the benefits of type inference and latent variable
modelling, to enable as few inference resources to be used as possible. We overcome the
limitations of previous approaches by learning column types directly from the data. We
have shown in the experimental section that our inference is able to detect column type
cheaply w.r.t. to inference.
6
Automatic Type Inference with NLFM
We envisage many interesting avenues for future research such as combining the re-
stricted IBP (Williamson et al.,2013), to allow the NLFM to select more pertinent latent
features based on prior knowledge – indeed, as the name implies; by restricting the infer-
ence process. This can have a great impact towards the goal of automated ingestion, and
in the wider context of AutoML. Particularly as much of today’s data is heterogeneous as
it is often collected from many different sources and sensors. Since our inference power
is limited, methods are needed that can accommodate advanced data exploration whilst
remaining economical on computational resources.
Acknowledgments
Thanks to Isabel Valera and Melanie Pradier for consistently answering our (many) ques-
tions on the GLFM and the GLTM.
References
Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL
http://archive.ics.uci.edu/ml.
Finale Doshi, Kurt Miller, Jurgen Van Gael, and Yee Whye Teh. Variational inference for
the Indian Buffet Process. In Artificial Intelligence and Statistics, pages 137–144, 2009.
Finale Doshi-Velez and Zoubin Ghahramani. Accelerated sampling for the Indian buffet
process. In Proceedings of the 26th annual international conference on machine learning,
pages 273–280. ACM, 2009.
Zoubin Ghahramani and Thomas L Griffiths. Infinite latent feature models and the Indian
buffet process. In Advances in neural information processing systems, pages 475–482,
2006.
Joseph M Hellerstein. Quantitative data cleaning for large databases. United Nations
Economic Commission for Europe (UNECE), 2008.
Jos´e Miguel Hern´andez-Lobato, James Robert Lloyd, Daniel Hern´andez-Lobato, and Zoubin
Ghahramani. Learning the semantics of discrete random variables: Ordinal or categorical.
In NIPS Workshop on Learning Semantics, 2014.
Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham,
Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo
Buono. Research directions in data wrangling: Visualizations and transformations for
usable and credible data. Information Visualization, 10(4):271–288, 2011.
Francisco Ruiz, Isabel Valera, Carlos Blanco, and Fernando Perez-Cruz. Bayesian non-
parametric modeling of suicide attempts. In Advances in neural information processing
systems, pages 1853–1861, 2012.
Francisco JR Ruiz, Isabel Valera, Carlos Blanco, and Fernando Perez-Cruz. Bayesian non-
parametric comorbidity analysis of psychiatric disorders. Journal of Machine Learning
Research, 15(1):1215–1247, 2014.
7
Automatic Type Inference with NLFM
I. Valera, M. F. Pradier, M. Lomeli, and Z. Ghahramani. General Latent Feature Models
for Heterogeneous Datasets. ArXiv e-prints, June 2017.
Isabel Valera and Zoubin Ghahramani. General table completion using a Bayesian nonpara-
metric model. In Advances in Neural Information Processing Systems, pages 981–989,
2014.
Isabel Valera and Zoubin Ghahramani. Automatic discovery of the statistical types of
variables in a dataset isabel. In International Conference on Machine Learning (ICML),
Sydney, 2017.
Sinead A Williamson, Steve N MacEachern, and Eric P Xing. Restricting exchangeable
nonparametric distributions. In Advances in Neural Information Processing Systems,
pages 2598–2606, 2013.
8
Automatic Type Inference with NLFM
Appendix A. Mapping Functions
The function fthat maps the pseudo-observation yd
ninto the observation xd
ndepends on
the data type. We summarize below the different choices that we have made for continu-
ous variables (real-valued and positive real-valued) and for discrete variables (categorical,
ordinal and count data). We use the same mappings functions as Valera and Ghahramani
(2014).
Positive real-valued data
We assume x=f(y) = log (exp (wy +µ) + 1).
Categorical data
We assume x=f(y) = arg max
r∈{1,...,Rd}
yr, with yd
nr ∼ N znbd
r, σ2
y,u∼ N 0, σ2
y.
The probability is
pxd
n=r|zn,bd=Ep(u)=
Rd
Y
j=1,j6=r
Φu+znbd
rbd
j
(8)
Ordinal data
We assume
xd
n=fyd
n=
1 if yd
n6θd
1
2 if θd
1< yd
n6θd
2
.
.
.
Rdif θd
Rd< yd
n
The probability is
pxd
n=r|zn,bd= Φ θd
rznbd
σyΦ θd
r1znbd
σy!(9)
Count data
We assume
xd
n=fyd
n= floor g(yd
n)= floor log 1 + exp wyd
n.(10)
The probability is
pxd
n|zn,bd= Φ f1xd
n+ 1znbd
σy!Φ f1xd
nznbd
σy!(11)
9
Automatic Type Inference with NLFM
Appendix B. Sampling pseudo-observations
At the core of our method is the innovation by Valera and Ghahramani (2014) which enables
Xto contain the multiple types found in Figure 1, as well as efficient inference (Valera and
Ghahramani,2017). In that model representation, for each observation xd
n, an auxiliary
Gaussian variable yd
n, is introduced. In and of itself, this need stems from the requirement
to develop efficient inference algorithms as conjugate priors, in the heterogeneous case, does
not hold (Valera et al.,2017).
Consequently there exists a ‘pseudo-observation’ yd
nunder the assumption that there
exists a transformation fm:xd
n7→ yd
n, from the heterogeneously observed space (where
xd
nlives) to the pseudo-observed space. The major difference between these spaces is that
the latter is a simulated Gaussian space. Under this construction (Valera and Ghahra-
mani,2014), the real line Ris mapped to support Ωmof the dth attribute in X. Further,
pseudo-observations are simulated as yd
n∼ N(znbd, σ d
y), and when the latent variables are
conditioned on the pseudo-observations, the latent variables model behaves as a standard
conjugate Gaussian model, with the associated efficient inference (Valera and Ghahramani,
2017,2014). Meaning that Zefficiently yields to the inference procedure.
We have discussed pseudo-observations in the context of the outer model, now we turn
to the inner model. Following Valera and Ghahramani (2017), we introduce latent variable
sd
nwhich indicates the the likelihood model of observation xd
ns.t. sd
nMultinomial(wd).
Hence, the transformation and pseudo-observation are found as
xd
n=fsd
n(yd
nsd
n+ud
n) where yd
n∼ N(vnad, σ d
y) (12)
where ud
n∼ N(0, σ 2
u) is a noise variable (Valera and Ghahramani,2017) s.t. that the
support of fsd
nis Ωmfor likelihood model m. As an example, consider the transformation
which maps the Gaussian variable to the support Ωmwhen xd
nis a a positive real-valued
observation. We could use transformation xd
n=fR+(yd
n+ud
n) wherein fR+(·) is a monotonic
differentiable function (Valera and Ghahramani,2014). Using this conversion the likelihood
function xd
nR+is given by
p(xd
n|zn,bd) = 1
Aexp (1
2σ2
y+σ2
uf1
R+xd
nznbd2)
d
dxd
n
f1
R+xd
n(13)
where A=q2πσ2
y+σ2
uand the inverse f1
R+:R+7→ R. The full set of transformations are
given in the appendix and an excellent synopsis of this idea is found in (Valera and Ghahra-
mani,2014). We transform our raw observations Xand store the pseudo-observations in
Y. Where dimensions of Ycan contain matrices when they are found to have a categorical
likelihood model.
B.1. Posterior distributions for pseudo-observations
zn∼ N µd
z,Σz, with µz= ΣzPN
nPl∈Ldbd
lyd
nland Σz=PdPl∈Ldbd
lbd
l>+σ1
zI
bd
l∼ N µd
l,Σb, with µd
l= ΣbPN
nz>
nyd
nland Σb=σ2
yZ>Z+σ2
bI
Following Valera and Ghahramani (2017), we derive the posterior distributions for the
pseudo observations for each variable type.
10
Automatic Type Inference with NLFM
Continuous data
pyd
nl |xd
n,zn,bd, sd
n=l=Nyd
n|ˆµy,ˆσ2
y(14)
with ˆµy=(znbd
l
σ2
y+f1
l(xd
n)
σ2
uˆσ2
yˆσ2
y=1
σ2
y+1
σ2
u1
Categorical data
pyd
nr |xd
n=T, zn,bd, sd
n= cat=(T N(yd
n|znbd
r, σ2
y,max
r06=r(yd
nr0),)r=T
T N yd
n|znbd
r, σ2
y,−∞, yd
nT r6=T(15)
where T N denotes a Truncated Normal distribution.
Ordinal data
pyd
n|xd
n=r, zn,bd, sd
n= ord=T N yd
n|znbd, σ2
y, θd
r1, θd
r(16)
with pθd
r|yd
n=T N θd
r|0, σ2
θ, θmin, θmax
θmin = max θd
r1,max
nyd
n|xd
n=r
θmax = min θd
r,min
nyd
n|xd
n=r+ 1
θ1is fixed
Count data
pyd
n|xd
n,zn,bd, sd
n= count=T N yd
n|znbd, σ2
y, g1xd
n, g1xd
n+ 1 (17)
with gdefined in 10.
(18)
Appendix C. Experimental setup
C.1. Synthetic data
Age Available credit Civil status Credit score Eats pasta Gender Salary
21 2200.0 MARRIED –3.2 never M 3000.0
21 100.0 SINGLE –1.1 never M 1200.0
19 2200.0 SINGLE 9.7 never F 1800.0
30 1100.0 SINGLE 4.4 sometimes M 3040.5
21 2000.0 MARRIED 2.3 often F 1100.0
21 100.0 SINGLE 0.2 usually F 1000.0
19 6000.0 WIDOW –5.3 always F 900.0
Table 1: Heterogeneous synthetic data used for testing the NLFM.
In Figure 4we show the output of our model on two of the seven features present in
the data. We conducted two sets of experiments to ascertain the utility of our method. In
11
Automatic Type Inference with NLFM
the first instance, we ran five nested loops with 20 iterations of the inner model and 10
for the outer (a total of 150 MCMC runs). We initialised the run with K= 5. Secondly
we wanted to investigate how our model performed when we increased major parameters
of the algorithm. To do this we ran 10 nested loops, with 50 iterations of the inner model
and 10 for the outer resulting in a total of 600 MCMC iterations. Note that although
our algorithm and model are comparatively complex, Valera and Ghahramani (2017) run
their model multiple times to find a Kthat is suitable for differentiating different likelihood
models. Furthermore, they use 5,000 MCMC runs for all their experiments.
012345
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [w]
Categorical
Ordinal
Count
(a) ‘Civil status’.
012345
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [w]
Real
Positive real
(b) ‘Salary’.
012345678910
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [w]
Categorical
Ordinal
Count
(c) ‘Civil status’.
012345678910
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [w]
Real
Positive real
(d) ‘Salary’
Figure 4: Latent features and likelihood weights of NLFM on two features (‘Civil status’ and
‘Salary’) of the synthetic dataset in Table 1. In the first row, the model was run on
5 loops, 20 inner model iterations, and 10 outer model iterations, while on the second
row it was run on 10 nested loops, 50 iterations of the inner model and 10 iterations of
the outer model.
The model detects the correct type for every variable in the dataset. Figure 4shows that
the type of variables ‘Civil status’ and ‘Salary’ are identified as categorical and real, respec-
tively, with a clear separation between model likelihoods. The magnitude of Khas more
impact on fewer iterations, as shown in the first row, where the performance is significantly
improved once the value of K= 3 is found, reflected in both attributes (see Figure 4(a)
and Figure 4(b)). Even when the value is less significant, however, it is important to note
that Kneeds not be fixed a priori as it is learnt by the model. Apart from passing Land
Kbetween the inner and out model, all runs are independent.
12
Automatic Type Inference with NLFM
Appendix D. Additional results
We ran two separate runs to ascertain whether results were one-offs. Once again all types are
correctly identified (two of which are shown in Figure 6), using limited inference resources
to tackle the problem (see table 2).
# NLFM [GLFM] NLFM [GLTM] GLTM K[GLTM] # Nested loops α
1 100 250 250 10 20 4
2 250 250 500 15 20 4
Table 2: Experimental parameters used for experiments. The first column (from the left)
is the set name, the second and third note how many MCMC iterations we used
for the NLFM, where we used an equivalent amount for the GLTM alone. The K
column notes how many set latent features we used when the GLTM was run alone.
Where the number of nested loops notes how many independent runs we used for
each dataset. In total, for the GLTM alone, we ran 5,000 MCMC simulations.
13
Automatic Type Inference with NLFM
0 2 4 6 8 10 12 14 16 18 20
Nested loops [#]
0
10
20
30
40
Latent features [K]
K(NLFM) K(GLTM)
0.10
0.08
0.06
0.04
0.02
0.00
Test log-likelihood []
NLFM (GLTM) NLFM (GLFM) GLTM
(a) Experimental set 1
0 2 4 6 8 10 12 14 16 18 20
Nested loops [#]
0
10
20
30
40
Latent features [K]
K(NLFM) K(GLTM)
0.10
0.08
0.06
0.04
0.02
0.00
Test log-likelihood []
NLFM (GLTM) NLFM (GLFM) GLTM
(b) Experimental set 2
Figure 5: Average test log-likelihood on held-out data on the wine dataset.
14
Automatic Type Inference with NLFM
012345678910
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [wd]
Categorical Ordinal Count
(a) Output variable
012345678910
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [wd]
Categorical Ordinal Count
(b) Output variable
012345678910
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [wd]
Positive real Real
(c) Proanthocyanins (pH)
012345678910
Nested loops [#]
0
2
4
6
8
10
Latent features [K]
0.0
0.2
0.4
0.6
0.8
1.0
Likelihood weights [wd]
Positive real Real
(d) Proanthocyanins (pH)
Figure 6: Results of running the NLFM on the red wine dataset (Dheeru and Karra Taniskidou,
2017). The top row represents the same discrete variable for two separate experiments.
The bottom row, shows the same continuous variable for two independent experiments.
15
... A recent empirical study [14] shows that HI-VAE fails to recover the marginal distributions correctly. Finally, another orthogonal line of work focuses on using traditional latent variable models to infer the variable types automatically [3,4,26,8]. However, they have been shown to be surpassed by VAE-based models empirically [18]. ...
Preprint
Full-text available
Deep generative models often perform poorly in real-world applications due to the heterogeneity of natural data sets. Heterogeneity arises from data containing different types of features (categorical, ordinal, continuous, etc.) and features of the same type having different marginal distributions. We propose an extension of variational autoencoders (VAEs) called VAEM to handle such heterogeneous data. VAEM is a deep generative model that is trained in a two stage manner such that the first stage provides a more uniform representation of the data to the second stage, thereby sidestepping the problems caused by heterogeneous data. We provide extensions of VAEM to handle partially observed data, and demonstrate its performance in data generation, missing data prediction and sequential feature selection tasks. Our results show that VAEM broadens the range of real-world applications where deep generative models can be successfully deployed.
Article
Full-text available
In spite of advances in technologies for working with data, analysts still spend an inordinate amount of time diagnosing data quality issues and manipulating data into a usable form. This process of 'data wrangling' often constitutes the most tedious and time-consuming aspect of analysis. Though data cleaning and integration are longstanding issues in the database community, relatively little research has explored how interactive visualization can advance the state of the art. In this article, we review the challenges and opportunities associated with addressing data quality issues. We argue that analysts might more effectively wrangle data through new interactive systems that integrate data verification, transformation, and visualization. We identify a number of outstanding research questions, including how appropriate visual encodings can facilitate apprehension of missing data, discrepant values, and uncertainty; how interactive visualizations might facilitate data transform specification; and how recorded provenance and social interaction might enable wider reuse, verification, and modification of data transformations.
Article
Full-text available
The analysis of comorbidity is an open and complex research field in the branch of psychiatry, where clinical experience and several studies suggest that the relation among the psychiatric disorders may have etiological and treatment implications. In this paper, we are interested in applying latent feature modeling to find the latent structure behind the psychiatric disorders that can help to examine and explain the relationships among them. To this end, we use the large amount of information collected in the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) database and propose to model these data using a nonparametric latent model based on the Indian Buffet Process (IBP). Due to the discrete nature of the data, we first need to adapt the observation model for discrete random variables. We propose a generative model in which the observations are drawn from a multinomial-logit distribution given the IBP matrix. The implementation of an efficient Gibbs sampler is accomplished using the Laplace approximation, which allows integrating out the weighting factors of the multinomial-logit likelihood model. We also provide a variational inference algorithm for this model, which provides a complementary (and less expensive in terms of computational complexity) alternative to the Gibbs sampler allowing us to deal with a larger number of data. Finally, we use the model to analyze comorbidity among the psychiatric disorders diagnosed by experts from the NESARC database.
Conference Paper
Full-text available
We often seek to identify co-occurring hid- den features in a set of observations. The Indian Buffet Process (IBP) provides a non- parametric prior on the features present in each observation, but current inference tech- niques for the IBP often scale poorly. The collapsed Gibbs sampler for the IBP has a running time cubic in the number of obser- vations, and the uncollapsed Gibbs sampler, while linear, is often slow to mix. We present a new linear-time collapsed Gibbs sampler for conjugate likelihood models and demonstrate its efficacy on large real-world datasets.
The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) database contains a large amount of information, regarding the way of life, medical conditions, etc., of a representative sample of the U.S. population. In this paper, we are interested in seeking the hidden causes behind the suicide attempts, for which we propose to model the subjects using a nonparametric latent model based on the Indian Buffet Process (IBP). Due to the nature of the data, we need to adapt the observation model for discrete random variables. We propose a generative model in which the observations are drawn from a multinomial-logit distribution given the IBP matrix. The implementation of an efficient Gibbs sampler is accomplished using the Laplace approximation, which allows integrating out the weighting factors of the multinomial-logit likelihood model. Finally, the experiments over the NESARC database show that our model properly captures some of the hidden causes that model suicide attempts.
Article
The Indian Buet Process (IBP) is a non- parametric prior for latent feature models in which observations are influenced by a com- bination of hidden features. For example, images may be composed of several objects and sounds may consist of several notes. La- tent feature models seek to infer these un- observed features from a set of observations; the IBP provides a principled prior in situa- tions where the number of hidden features is unknown. Current inference methods for the IBP have all relied on sampling. While these methods are guaranteed to be accurate in the limit, samplers for the IBP tend to mix slowly in practice. We develop a deterministic vari- ational method for inference in the IBP based on a truncated stick-breaking approximation, provide theoretical bounds on the truncation error, and evaluate our method in several data regimes.
Conference Paper
We define a probability distribution over equivalence classes of binary ma- trices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We derive the distribution by taking the limit of a distribution over N × K binary matrices as K ! 1, a strategy inspired by the derivation of the Chinese restaurant process (Aldous, 1985; Pitman, 2002) as the limit of a Dirichlet-multinomial model. This strategy preserves the exchangeability of the rows of matrices. We define several simple generative processes that result in the same distri- bution over equivalence classes of binary matrices, one of which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algo- rithm for inference in this model and applying this algorithm to an artificial dataset.
UCI machine learning repository
  • Dua Dheeru
  • Efi Karra Taniskidou
Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Learning the semantics of discrete random variables: Ordinal or categorical
  • José Miguel Hernández-Lobato
  • James Robert Lloyd
  • Daniel Hernández-Lobato
  • Zoubin Ghahramani
José Miguel Hernández-Lobato, James Robert Lloyd, Daniel Hernández-Lobato, and Zoubin Ghahramani. Learning the semantics of discrete random variables: Ordinal or categorical. In NIPS Workshop on Learning Semantics, 2014.