Content uploaded by Neil Dhir

Author content

All content in this area was uploaded by Neil Dhir on Sep 18, 2019

Content may be subject to copyright.

JMLR: Workshop and Conference Proceedings 1–13 AutoML 2018 - ICML Workshop

Automatic Type Inference

with a Nested Latent Variable Model

Neil Dhir neil.dhir@mindfoundry.ai

Davide Zilli davide.zilli@mindfoundry.ai

Tomasz Rudny tomasz.rudny@mindfoundry.ai

Alessandra Tosi alessandra.tosi@mindfoundry.ai

Mind Foundry Ltd, Ewert House, Ewert Place, Oxford, OX2 7DD, United Kingdom

Abstract

We present the ﬁrst steps towards a type inferential general latent feature model: the

nested latent feature model (NLFM). The NLFM combines automatic type inference and

latent feature modelling for heterogeneous data. By combining inference over type, as well

as over latent features, during runtime, we condition each latent parameter on all others,

consequently we are able to measure the likelihood of an inferred column type, under the

current setting of the latent feature (and vice versa). In so doing we receive conﬁdence

scores on a) our inferred column set and b) our latent features under the inferred column

set – all within one model. In addition, we seek to make inference cheaper by eﬃcient

model design.

Keywords: Latent feature modelling, type inference, automatic data ingestion

1. Introduction

Automatic machine learning (AutoML) is a concept that has grown in prominence since the

inception of powerful machine learning methods. At its core AutoML seeks to progressively

augment human machine learning experts, to enable them to focus on data analysis, whilst

leaving more mundane tasks, such as data ingestion, to an automatic transformation and

identiﬁcation system.

L

nR

nR

fx

n2Xiji2Zg

fx

n2Xij1ing

x

n2\[0+1

Discrete ontinuous

Figure 1: Diﬀerent data types.

As the big data drive shows little sign of abating,

there has recently been a growing body of work which

attempts to automate and accelerate diﬀerent stages

(Valera and Ghahramani,2017) of the pre-processing

pipeline. Hellerstein (2008) presents a qualitative

monograph on the subject of data-cleaning, noting

references and resources for automation. Similarly,

Kandel et al. (2011) discuss research directions in

data wrangling (diagnosing data quality issues and

manipulating data into a usable form), whereas Hern´andez-Lobato et al. (2014) touch upon

the subject of type inference i.e. estimating whether the object attributes belong to a

certain likelihood model family (see ﬁgure 1). Whilst it is trivial to construct simple heuris-

tics to distinguish between major types of data (e.g. continuous vs. discrete), it is much

harder to reason about sub-types. For example, presented with a univariate data-sample:

c

N. Dhir, D. Zilli, T. Rudny & A. Tosi.

Automatic Type Inference with NLFM

{10,12,15,20,25}, it is diﬃcult to determine if this is an ordinal attribute (IAAF recognised

common distances for road races) or a numeric categorical attribute (Royal Navy pennant

numbers). The former has order, the latter does not.

While a few select methods exist for heterogeneous modelling and type inference, to our

knowledge, there exist no complete model which can do both. We present the nested latent

feature model (NLFM), an unsupervised and nonparametric method that allows the column

type to be learned conditioned on the inferred number of latent features and vice versa.

2. Type inferential general latent feature model

Posit that our observations (data) are stored in a design matrix Xof size N×D. We denote

object non the rows, as xn,[x1

n, x2

n, . . . , xd

n, . . . , xD

n]. In probabilistic low-rank matrix

factorisation (Valera and Ghahramani,2014) we assume that Xcan be approximated by

the Hadamard product X≈ZB+between a binary feature matrix Z∈ {0,1}N×K(but

can also be e.g. Gaussian distributed) and a weight matrix B∈RK×D. Here is white

Gaussian noise ∼ N(0,Σ), independent of Zand Band uncorrelated across observations.

Database modelling typically assumes that column likelihood models (types) md∀d∈

{1, . . . , D}are known a priori.Valera and Ghahramani (2017) proposed a Bayesian method

which accurately discovers statistical data types, by imposing a likelihood model prior,

where the possible set of types is a priori unknown so that the type set becomes a set

of sets L={m∈ Ld|d= 1 ...,D}where Ld,{mj|j= 1, . . . , J}speciﬁes a set of

potential likelihood models, of size |Ld|=J, for column d.Valera and Ghahramani (2017)’s

method (from here on referred to as the general latent type model (GLTM)) concerns ﬁnding

likelihood-weights wd,[wd

1, . . . , wd

J] for each d, which captures the probability of type mj,

in attribute xd∈X. We will extend that model by combining it with the General Latent

Feature Model (GLFM) Valera et al. (2017).

The model described below is capable of complete1automatic heterogeneous latent fea-

ture modelling by virtue of automatic type inference. This is achieved, broadly, by wrapping

a type inference mechanism (we refer to this as the ‘inner’ model), with latent feature mod-

elling (outer). By interleaving nonparametric inference over the number of latent features

K, evaluating the inner model at this K, and then updating the posterior estimate of the

types, we propose a form of global optimisation over Kand data types. Under Bayes’ law

we can factorise the relevant likelihoods into two main components:

p(Z,B,V,A,L|X)∝p(X|Z,B,L)p(Z)p(B)

| {z }

§2.1

·p(L|V,A,Z)p(V)p(A)

| {z }

§2.2

·p(L) (1)

The ﬁrst PMF describes a binary latent feature model, conditioned on heterogeneous data

types, discussed in Section 2.1. The second PMF concerns posterior estimates of data

types conditioned on binary latent features, discussed in Section 2.1. As standalone models

both the GLFM and GLTM require speciﬁc information which both possess (Kand L

respectively), but which is not shared in between them. This explicit sharing of information

is achieved in the NLFM (see Figure 2(c)).

1. By ‘complete’ we mean: data infusion, latent variable modelling and type prediction.

2

Automatic Type Inference with NLFM

w

x

m L

m

d f; ; Dg

(a) GLTM.

Z

X

(b) GLFM.

w

ZV

x

X

m L

a

m

d f; ; Dg

a

(c) NLFM.

Figure 2: Generative latent feature and type models. In Figure 2(a)the appropriate likelihood

model m∈ Ldis sought, in Figure 2(b)the set of likelihood models Lis assumed to be

known. Finally in Figure 2(c)the NLFM performs inference (nonparametrically) over:

latent features Z, latent feature weights wdwell as the observation types (i.e. likelihood

models m∈ L). For clarity and to avoid clutter we have omitted auxiliary parameters.

2.1. Latent feature modelling

The outer model consists of a nonparametric LFM. We seek to ﬁnd the posterior distribution

of Zand B, where independence is assumed a priori for Zand B, the likelihood model

p(X|Z,B) and the feature prior p(B) are determined by the application (Doshi et al.,

2009). Our interest is that of latent feature modelling so the likelihood can be factorised as

p(X|Z,B,L) =

D

Y

d=1

N

Y

n=1

p(xd

n|zn,bd, md).(2)

The density p(Z) requires a ﬂexible prior (Ghahramani and Griﬃths,2006) and because

Zis an indicator matrix which tells us if a particular feature in Bis present in object n, we

want to determine Kat runtime but also leave it unbounded. Further, when we allow X

to include heterogeneous data types, we can capture the latent structure using the general

latent feature model (GLFM) (Valera et al.,2017) – seen in Figure 2(b). Under this model

the type set Lis assumed known. This is one of the drawbacks that we address by integrating

automatic type inference, which allows Lto be determined during inference time. We

discuss in Section 2.2 how posterior type estimates are produced. The model in Equation (2)

extends the LG-LFM in (Ghahramani and Griﬃths,2006) and Valera et al. (2017) extended

one of the central tenets of that model, the Indian buﬀet process (IBP), to account for

heterogeneous data while maintaining the model complexity of conjugate models. The IBP

places a prior on binary matrices where the number of columns, corresponding to latent

features K, is potentially inﬁnite and can be inferred from the data, and its utility rests

on two foundations. First, we can learn the model complexity from the data. Second,

Valera et al. (2017) argues that binary-valued latent features which have been shown to

provide more interpretable results in data exploration than standard real-valued latent

feature models (Ruiz et al.,2012,2014).

Finally, because we have an estimate of Lthis means our posterior estimate of Xin-

corporates our uncertainty about the type itself. This is important because it may provide

further insight into the generative process of a particular column, e.g. regarding its ordinal

(or not) nature. As we saw in the example in Section 1this is not a trivial exercise, and

3

Automatic Type Inference with NLFM

one which warrants automation for large datasets wherein data exploration is required or

desired.

2.2. Type inference

The inner part of the NLFM is also a LFM, but a parametric one – see Figure 2(a). As

demonstrated by Valera and Ghahramani (2017) this type of model requires the number of

latent features Kto be set a priori. However, type prediction is intimately linked to the size

of K, consequently it follows that joint inference over both type and Kis warranted. We

will demonstrate how information is passed between the outer model to the inner model,

and vice versa, to illicit eﬃcient inference over latent features and latent types. Like before

we seek a low-rank representation of our observations

p(X|V,A,Z) =

D

Y

d=1

N

Y

n=1

p(xd

n|vn,ad,zn).(3)

where we are now investigating the right-hand part of Equation (1). Similar to before latent

features per row are captured by vnand stored in V, and feature weights by ad, stored

in matrix A. Herein the latent parameters that need inferring are the likelihood models

{md|d= 1, . . . , D}. But by explicitly conditioning on the binary feature matrix from the

outer model, we do not need to perform inference over the model complexity, as we posit

that this information is implicitly contained in Zon which we already have a posterior

estimate from the outer model. Hence Kis ﬁxed from the outset but, crucially, it is an

inferred parameter unlike in (Valera and Ghahramani,2017).

Our interest lies in capturing the generative distribution of each attribute xd∈X. We

can entertain this problem using the main idea from (Valera and Ghahramani,2017), where

we assume that the likelihood model of xd∈Xis a mixture of likelihood models

pm(xd|V,{ad

m}m∈Ld,Z) = X

m∈Ld

wd

m·pm(xd|V,ad

m,Z) (4)

with type-speciﬁc mixture weights given by wd

mand type-speciﬁc likelihood model by pm(·).

We have kept the dependence on Zto emphasise that model complexity is carried from

the outer model and not inferred in the inner model, this is the key diﬀerence between our

method and (Valera and Ghahramani,2017). The weight wd

mdenotes the probability that

model mis responsible for observations xd∈Xfor column d. The usual provisos hold

for this mixture: Pm∈Ldwd

m= 1 and all densities pm(·) are normalised. Accordingly the

likelihood factorises as

p(X|V,{ad

m}m∈Ld,Z) =

D

Y

d=1 X

m∈Ld

wd

m·pm(xd|V,ad

m,Z).(5)

3. Inference

In the original LG-LFM by Ghahramani and Griﬃths (2006) real-valued observations, in

conjunction with conjugate likelihoods, is what allows for fast inference algorithms (Valera

4

Automatic Type Inference with NLFM

et al.,2017). Indeed, exchangeable models often yield eﬃcient Gibbs samplers (Williamson

et al.,2013). But because we are considering heterogeneous observations, such conjugacy

is no longer available. However using the mechanism described by Valera and Ghahramani

(2014), with Gaussian pseudo-observations (see appendix B for details), we are able to

transform the observation space so that it is amenable to eﬃcient MCMC inference.

3.1. Binary latent features

Herein we describe the fundamental steps required for inference over the latent variables

in the outer model. Our approach is similar to that by (Valera et al.,2017;Doshi-Velez

and Ghahramani,2009) with the addendum that we condition on L. We use an instance of

the sampler proposed by Valera et al. (2017), where the probability of each element in Zis

given by

p(znk = 1 | {y}D

d=1,Z−nk ,L)∝m−nk

M

D

Y

d=1

Sd

Y

r=1 Zbd

r

p(yd

nr |zn,bd

r)p(bd

r|yd

−nr Z−n) dbd

r(6)

where Sdis the number of columns in matrices Yand Bwhich contain categorical types

(note also the explicit conditioning on L. All other types render Sd= 1 i.e. when a column

is not categorical. Further Z−nis Zwith the nth line removed. Where yd

−nr is the rth

column of matrix Y(r= 1 if the type is not categorical), without element yd

nr. Finally,

p(bd

r|yd

−nr Z−n) = N(bd

r|P−1

−nλd

−nr,P−1

−n) is the posterior of the feature weight without

taking the nth observation into account (Valera et al.,2017). Valera et al. (2017) explain

that P−n=ZT

−nZ+σ−2

BIand λd

−nr =ZT

−nryd

−nr are the natural parameters of the Gaussian

distribution.

3.2. Data types

The updates for the inner model are easier since there are no nonparametric mecha-

nisms, and feature vectors can be eﬃciently sampled. Again we take our queue from

Valera and Ghahramani (2017) but note that in these equations our latent state num-

ber Khas been inferred from the outer model – see Section 3.1. Now, row vnof Vis

simulated from vn∼ N(µd

v,Σv), herein Σvn=PD

d=1 Pm∈Ldad

m(ad

m)T+1

σ2

vnI−1and

µvn=ΣvnPN

n=1 Pm∈Ldad

myd

nm. Similarly the feature weights updates have parameters

Σb=1

σ2

yVTV+1

σ2

bI−1and µd

m=ΣaPN

n=1 vT

nyd

nm which means ad

m∼ N(µd

m,Σb).

The pseudo-observation updates are as described in (Valera and Ghahramani,2014,2017;

Valera et al.,2017) and can be found in Appendix B. Finally, the update of the likelihood

assignment sd

nis given by:

psd

n=m|wd,Z,V,A=wd

mpmxd

n|zn,ad

m

Pl0∈Ldwd

m0pm0xd

n|zn,ad

m0(7)

Finally, to echo Valera and Ghahramani (2017), we place a prior on the likelihood weight

vector wdfor dimension d, using a Dirichlet distribution parametrised by {αm}m∈Ld. Then

using the likelihood assignments in Equation (7), we exploit conjugacy to make parameter

updates.

5

Automatic Type Inference with NLFM

0 2 4 6 8 10 12 14 16 18 20

Nested loops [#]

0

10

20

30

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Categorical Ordinal Count

(a) Cultivar (NLFM)

0 2 4 6 8 10 12 14 16 18 20

Nested loops [#]

0.0

2.5

5.0

7.5

10.0

12.5

15.0

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Categorical Ordinal Count

(b) Cultivar (GLTM)

0 2 4 6 8 10 12 14 16 18 20

Nested loops [#]

0

10

20

30

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Categorical Ordinal Count

(c) #Phenols (NLFM)

0 2 4 6 8 10 12 14 16 18 20

Nested loops [#]

0.0

2.5

5.0

7.5

10.0

12.5

15.0

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Categorical Ordinal Count

(d) #Phenols (GLTM)

Figure 3: Results of running the NLFM on the red wine dataset (Dheeru and Karra Taniski-

dou,2017) using experimental set 1. Shown are the results from inference over

both the discrete variables in the dataset. The dashed line (---) in each plot,

shows the expected parameter value.

4. Experiments

We apply our method to an established dataset of red wines (Dheeru and Karra Taniskidou,

2017). The red wine dataset results from chemical analysis of wines grown in the same region

in Italy but derived from three diﬀerent cultivars. The dataset has 13 attributes, mixed

continuous and discrete. The results in ﬁg. 3demonstrate that Khas an eﬀect on type since

the GLTM and the NLFM predicts diﬀerent discrete types for the total number of phenols.

The NLFM gets it right, whereas the GLTM does not. Additional results and experimental

details are found in appendix D.

5. Discussion and Conclusion

In this paper we have presented a general model suitable for the analysis of heterogeneous

data: the NLFM. The NLFM combines the beneﬁts of type inference and latent variable

modelling, to enable as few inference resources to be used as possible. We overcome the

limitations of previous approaches by learning column types directly from the data. We

have shown in the experimental section that our inference is able to detect column type

cheaply w.r.t. to inference.

6

Automatic Type Inference with NLFM

We envisage many interesting avenues for future research such as combining the re-

stricted IBP (Williamson et al.,2013), to allow the NLFM to select more pertinent latent

features based on prior knowledge – indeed, as the name implies; by restricting the infer-

ence process. This can have a great impact towards the goal of automated ingestion, and

in the wider context of AutoML. Particularly as much of today’s data is heterogeneous as

it is often collected from many diﬀerent sources and sensors. Since our inference power

is limited, methods are needed that can accommodate advanced data exploration whilst

remaining economical on computational resources.

Acknowledgments

Thanks to Isabel Valera and Melanie Pradier for consistently answering our (many) ques-

tions on the GLFM and the GLTM.

References

Dua Dheeru and Eﬁ Karra Taniskidou. UCI machine learning repository, 2017. URL

http://archive.ics.uci.edu/ml.

Finale Doshi, Kurt Miller, Jurgen Van Gael, and Yee Whye Teh. Variational inference for

the Indian Buﬀet Process. In Artiﬁcial Intelligence and Statistics, pages 137–144, 2009.

Finale Doshi-Velez and Zoubin Ghahramani. Accelerated sampling for the Indian buﬀet

process. In Proceedings of the 26th annual international conference on machine learning,

pages 273–280. ACM, 2009.

Zoubin Ghahramani and Thomas L Griﬃths. Inﬁnite latent feature models and the Indian

buﬀet process. In Advances in neural information processing systems, pages 475–482,

2006.

Joseph M Hellerstein. Quantitative data cleaning for large databases. United Nations

Economic Commission for Europe (UNECE), 2008.

Jos´e Miguel Hern´andez-Lobato, James Robert Lloyd, Daniel Hern´andez-Lobato, and Zoubin

Ghahramani. Learning the semantics of discrete random variables: Ordinal or categorical.

In NIPS Workshop on Learning Semantics, 2014.

Sean Kandel, Jeﬀrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham,

Nathalie Henry Riche, Chris Weaver, Bongshin Lee, Dominique Brodbeck, and Paolo

Buono. Research directions in data wrangling: Visualizations and transformations for

usable and credible data. Information Visualization, 10(4):271–288, 2011.

Francisco Ruiz, Isabel Valera, Carlos Blanco, and Fernando Perez-Cruz. Bayesian non-

parametric modeling of suicide attempts. In Advances in neural information processing

systems, pages 1853–1861, 2012.

Francisco JR Ruiz, Isabel Valera, Carlos Blanco, and Fernando Perez-Cruz. Bayesian non-

parametric comorbidity analysis of psychiatric disorders. Journal of Machine Learning

Research, 15(1):1215–1247, 2014.

7

Automatic Type Inference with NLFM

I. Valera, M. F. Pradier, M. Lomeli, and Z. Ghahramani. General Latent Feature Models

for Heterogeneous Datasets. ArXiv e-prints, June 2017.

Isabel Valera and Zoubin Ghahramani. General table completion using a Bayesian nonpara-

metric model. In Advances in Neural Information Processing Systems, pages 981–989,

2014.

Isabel Valera and Zoubin Ghahramani. Automatic discovery of the statistical types of

variables in a dataset isabel. In International Conference on Machine Learning (ICML),

Sydney, 2017.

Sinead A Williamson, Steve N MacEachern, and Eric P Xing. Restricting exchangeable

nonparametric distributions. In Advances in Neural Information Processing Systems,

pages 2598–2606, 2013.

8

Automatic Type Inference with NLFM

Appendix A. Mapping Functions

The function fthat maps the pseudo-observation yd

ninto the observation xd

ndepends on

the data type. We summarize below the diﬀerent choices that we have made for continu-

ous variables (real-valued and positive real-valued) and for discrete variables (categorical,

ordinal and count data). We use the same mappings functions as Valera and Ghahramani

(2014).

Positive real-valued data

We assume x=f(y) = log (exp (wy +µ) + 1).

Categorical data

We assume x=f(y) = arg max

r∈{1,...,Rd}

yr, with yd

nr ∼ N znbd

r, σ2

y,u∼ N 0, σ2

y.

The probability is

pxd

n=r|zn,bd=Ep(u)=

Rd

Y

j=1,j6=r

Φu+znbd

r−bd

j

(8)

Ordinal data

We assume

xd

n=fyd

n=

1 if yd

n6θd

1

2 if θd

1< yd

n6θd

2

.

.

.

Rdif θd

Rd< yd

n

The probability is

pxd

n=r|zn,bd= Φ θd

r−znbd

σy−Φ θd

r−1−znbd

σy!(9)

Count data

We assume

xd

n=fyd

n= ﬂoor g(yd

n)= ﬂoor log 1 + exp wyd

n.(10)

The probability is

pxd

n|zn,bd= Φ f−1xd

n+ 1−znbd

σy!−Φ f−1xd

n−znbd

σy!(11)

9

Automatic Type Inference with NLFM

Appendix B. Sampling pseudo-observations

At the core of our method is the innovation by Valera and Ghahramani (2014) which enables

Xto contain the multiple types found in Figure 1, as well as eﬃcient inference (Valera and

Ghahramani,2017). In that model representation, for each observation xd

n, an auxiliary

Gaussian variable yd

n, is introduced. In and of itself, this need stems from the requirement

to develop eﬃcient inference algorithms as conjugate priors, in the heterogeneous case, does

not hold (Valera et al.,2017).

Consequently there exists a ‘pseudo-observation’ yd

nunder the assumption that there

exists a transformation fm:xd

n7→ yd

n, from the heterogeneously observed space (where

xd

nlives) to the pseudo-observed space. The major diﬀerence between these spaces is that

the latter is a simulated Gaussian space. Under this construction (Valera and Ghahra-

mani,2014), the real line Ris mapped to support Ωmof the dth attribute in X. Further,

pseudo-observations are simulated as yd

n∼ N(znbd, σ d

y), and when the latent variables are

conditioned on the pseudo-observations, the latent variables model behaves as a standard

conjugate Gaussian model, with the associated eﬃcient inference (Valera and Ghahramani,

2017,2014). Meaning that Zeﬃciently yields to the inference procedure.

We have discussed pseudo-observations in the context of the outer model, now we turn

to the inner model. Following Valera and Ghahramani (2017), we introduce latent variable

sd

nwhich indicates the the likelihood model of observation xd

ns.t. sd

n∼Multinomial(wd).

Hence, the transformation and pseudo-observation are found as

xd

n=fsd

n(yd

nsd

n+ud

n) where yd

n∼ N(vnad, σ d

y) (12)

where ud

n∼ N(0, σ 2

u) is a noise variable (Valera and Ghahramani,2017) s.t. that the

support of fsd

nis Ωmfor likelihood model m. As an example, consider the transformation

which maps the Gaussian variable to the support Ωmwhen xd

nis a a positive real-valued

observation. We could use transformation xd

n=fR+(yd

n+ud

n) wherein fR+(·) is a monotonic

diﬀerentiable function (Valera and Ghahramani,2014). Using this conversion the likelihood

function xd

n∈R+is given by

p(xd

n|zn,bd) = 1

Aexp (−1

2σ2

y+σ2

uf−1

R+xd

n−znbd2)

d

dxd

n

f−1

R+xd

n(13)

where A=q2πσ2

y+σ2

uand the inverse f−1

R+:R+7→ R. The full set of transformations are

given in the appendix and an excellent synopsis of this idea is found in (Valera and Ghahra-

mani,2014). We transform our raw observations Xand store the pseudo-observations in

Y. Where dimensions of Ycan contain matrices when they are found to have a categorical

likelihood model.

B.1. Posterior distributions for pseudo-observations

zn∼ N µd

z,Σz, with µz= ΣzPN

nPl∈Ldbd

lyd

nland Σz=PdPl∈Ldbd

lbd

l>+σ−1

zI

bd

l∼ N µd

l,Σb, with µd

l= ΣbPN

nz>

nyd

nland Σb=σ−2

yZ>Z+σ−2

bI

Following Valera and Ghahramani (2017), we derive the posterior distributions for the

pseudo observations for each variable type.

10

Automatic Type Inference with NLFM

Continuous data

pyd

nl |xd

n,zn,bd, sd

n=l=Nyd

n|ˆµy,ˆσ2

y(14)

with ˆµy=(znbd

l

σ2

y+f−1

l(xd

n)

σ2

uˆσ2

yˆσ2

y=1

σ2

y+1

σ2

u−1

Categorical data

pyd

nr |xd

n=T, zn,bd, sd

n= cat=(T N(yd

n|znbd

r, σ2

y,max

r06=r(yd

nr0),∞)r=T

T N yd

n|znbd

r, σ2

y,−∞, yd

nT r6=T(15)

where T N denotes a Truncated Normal distribution.

Ordinal data

pyd

n|xd

n=r, zn,bd, sd

n= ord=T N yd

n|znbd, σ2

y, θd

r−1, θd

r(16)

with pθd

r|yd

n=T N θd

r|0, σ2

θ, θmin, θmax

θmin = max θd

r−1,max

nyd

n|xd

n=r

θmax = min θd

r,min

nyd

n|xd

n=r+ 1

θ1is ﬁxed

Count data

pyd

n|xd

n,zn,bd, sd

n= count=T N yd

n|znbd, σ2

y, g−1xd

n, g−1xd

n+ 1 (17)

with gdeﬁned in 10.

(18)

Appendix C. Experimental setup

C.1. Synthetic data

Age Available credit Civil status Credit score Eats pasta Gender Salary

21 2200.0 MARRIED –3.2 never M 3000.0

21 100.0 SINGLE –1.1 never M 1200.0

19 2200.0 SINGLE 9.7 never F 1800.0

30 1100.0 SINGLE 4.4 sometimes M 3040.5

21 2000.0 MARRIED 2.3 often F 1100.0

21 100.0 SINGLE 0.2 usually F 1000.0

19 6000.0 WIDOW –5.3 always F 900.0

Table 1: Heterogeneous synthetic data used for testing the NLFM.

In Figure 4we show the output of our model on two of the seven features present in

the data. We conducted two sets of experiments to ascertain the utility of our method. In

11

Automatic Type Inference with NLFM

the ﬁrst instance, we ran ﬁve nested loops with 20 iterations of the inner model and 10

for the outer (a total of 150 MCMC runs). We initialised the run with K= 5. Secondly

we wanted to investigate how our model performed when we increased major parameters

of the algorithm. To do this we ran 10 nested loops, with 50 iterations of the inner model

and 10 for the outer resulting in a total of 600 MCMC iterations. Note that although

our algorithm and model are comparatively complex, Valera and Ghahramani (2017) run

their model multiple times to ﬁnd a Kthat is suitable for diﬀerentiating diﬀerent likelihood

models. Furthermore, they use 5,000 MCMC runs for all their experiments.

012345

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [w]

Categorical

Ordinal

Count

(a) ‘Civil status’.

012345

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [w]

Real

Positive real

(b) ‘Salary’.

012345678910

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [w]

Categorical

Ordinal

Count

(c) ‘Civil status’.

012345678910

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [w]

Real

Positive real

(d) ‘Salary’

Figure 4: Latent features and likelihood weights of NLFM on two features (‘Civil status’ and

‘Salary’) of the synthetic dataset in Table 1. In the ﬁrst row, the model was run on

5 loops, 20 inner model iterations, and 10 outer model iterations, while on the second

row it was run on 10 nested loops, 50 iterations of the inner model and 10 iterations of

the outer model.

The model detects the correct type for every variable in the dataset. Figure 4shows that

the type of variables ‘Civil status’ and ‘Salary’ are identiﬁed as categorical and real, respec-

tively, with a clear separation between model likelihoods. The magnitude of Khas more

impact on fewer iterations, as shown in the ﬁrst row, where the performance is signiﬁcantly

improved once the value of K= 3 is found, reﬂected in both attributes (see Figure 4(a)

and Figure 4(b)). Even when the value is less signiﬁcant, however, it is important to note

that Kneeds not be ﬁxed a priori as it is learnt by the model. Apart from passing Land

Kbetween the inner and out model, all runs are independent.

12

Automatic Type Inference with NLFM

Appendix D. Additional results

We ran two separate runs to ascertain whether results were one-oﬀs. Once again all types are

correctly identiﬁed (two of which are shown in Figure 6), using limited inference resources

to tackle the problem (see table 2).

# NLFM [GLFM] NLFM [GLTM] GLTM K[GLTM] # Nested loops α

1 100 250 250 10 20 4

2 250 250 500 15 20 4

Table 2: Experimental parameters used for experiments. The ﬁrst column (from the left)

is the set name, the second and third note how many MCMC iterations we used

for the NLFM, where we used an equivalent amount for the GLTM alone. The K

column notes how many set latent features we used when the GLTM was run alone.

Where the number of nested loops notes how many independent runs we used for

each dataset. In total, for the GLTM alone, we ran 5,000 MCMC simulations.

13

Automatic Type Inference with NLFM

0 2 4 6 8 10 12 14 16 18 20

Nested loops [#]

0

10

20

30

40

Latent features [K]

K(NLFM) K(GLTM)

−0.10

−0.08

−0.06

−0.04

−0.02

0.00

Test log-likelihood [−]

NLFM (GLTM) NLFM (GLFM) GLTM

(a) Experimental set 1

0 2 4 6 8 10 12 14 16 18 20

Nested loops [#]

0

10

20

30

40

Latent features [K]

K(NLFM) K(GLTM)

−0.10

−0.08

−0.06

−0.04

−0.02

0.00

Test log-likelihood [−]

NLFM (GLTM) NLFM (GLFM) GLTM

(b) Experimental set 2

Figure 5: Average test log-likelihood on held-out data on the wine dataset.

14

Automatic Type Inference with NLFM

012345678910

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Categorical Ordinal Count

(a) Output variable

012345678910

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Categorical Ordinal Count

(b) Output variable

012345678910

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Positive real Real

(c) Proanthocyanins (pH)

012345678910

Nested loops [#]

0

2

4

6

8

10

Latent features [K]

0.0

0.2

0.4

0.6

0.8

1.0

Likelihood weights [wd]

Positive real Real

(d) Proanthocyanins (pH)

Figure 6: Results of running the NLFM on the red wine dataset (Dheeru and Karra Taniskidou,

2017). The top row represents the same discrete variable for two separate experiments.

The bottom row, shows the same continuous variable for two independent experiments.

15