Page 1

Deep Transfer as Structure Learning in Markov Logic Networks

David Moore and Andrea Danyluk

Williams College

Williamstown, MA

{10dam, andrea}@cs.williams.edu

Abstract

Learning the relational structure of a domain is a funda-

mental problem in statistical relational learning. The

deep transfer algorithm of Davis and Domingos at-

tempts to improve structure learning in Markov logic

networks by harnessing the power of transfer learn-

ing, using the second-order structural regularities of a

source domain to bias the structure search process in

a target domain. We propose that the clique-scoring

process which discovers these second-order regularities

constitutes a novel standalone method for learning the

structure of Markov logic networks, and that this fact,

rather than the transfer of structural knowledge across

domains, accounts for much of the performance bene-

fit observed via the deep transfer process. This claim is

supported by experiments in which we find that clique

scoring within a single domain often produces results

equaling or surpassing the performance of deep transfer

incorporating external knowledge, and also by explicit

algorithmic similarities between deep transfer and other

structure learning techniques.

Introduction

The growing field of statistical relational learning aims

to develop methods for learning, inference, and dealing

with uncertainty in domains which are fundamentally re-

lational, in contrast to traditional statistical learning tech-

niqueswhichareoftenlimitedtopropositionaldata. Markov

logic networks (MLNs) are a representation which gen-

eralizes first-order logic and probabilistic graphical mod-

els, using weighted formulas of first-order logic to repre-

sent knowledge about a relational domain (Richardson and

Domingos 2006). Learning an MLN representation for a

domain can be reduced to two problems: structure learning,

i.e. discovering a set of logical formulas, and weight learn-

ing, i.e. choosing a weight for each formula. Learning the

structure of an MLN for a relational domain is an impor-

tant problem for which several algorithms have been pro-

posed (for example, Kok and Domingos 2005; Mihalkova

and Mooney 2007; Biba, Ferilli, and Esposito 2008; Kok

and Domingos 2009).

Copyright c ? 2010, Association for the Advancement of Artificial

Intelligence (www.aaai.org). All rights reserved.

One approach to improving structure learning perfor-

mance is to apply transfer learning, which uses knowl-

edge gained from learning a source task to aid learning in

a related target task (Torrey and Shavlik 2009). The deep

transfer algorithm (DTM) proposed by Davis and Domin-

gos transfers knowledge between relational domains, which

may be composed of entirely different predicates, by ana-

lyzing structural tendencies of formulas in the source do-

main and representing them as domain-independent knowl-

edge in the form of second-order cliques. The cliques are

scored to identify the most salient structural tendencies, and

the highest-scoring cliques are instantiated as first-order for-

mulas in the target domain, where they provide a declara-

tive bias for structure learning. DTM has been shown to de-

liver improved learning performance for several pairings of

source and target domains, relative to pure structure learning

within the target domain.

In this paper, we argue that the performance increases

observed through DTM are not due primarily to its role in

transferring knowledge across domains, but instead that the

the algorithm is in fact performing a novel form of struc-

ture learning, which we will refer to as learning through

clique scoring with greedy selection (CSGL). As evidence,

we show that the DTM algorithm can be applied within a

single domain to obtain performance equalling or surpass-

ing the best cases of transfer, but without incorporating ex-

ternal knowledge. Furthermore, although DTM (and thus

CSGL) can make use of the MSL structure learning algo-

rithm (Kok and Domingos 2005) to refine learned theories,

we show that even when this step is omitted, structure learn-

ing through CSGL produces results equalling or exceeding

those of MSL.

We begin by briefly reviewing Markov logic and the deep

transfer algorithm. We then describe the role of clique scor-

ing in deep transfer and explain how, in conjunction with

first-order instantiation and greedy selection, it can be con-

strued as an algorithm for structure learning. We demon-

strate empirically that clique scoring can be used effectively

to learn in a target domain even without a distinct source do-

main, and we evaluate its performance as a standalone struc-

ture learning algorithm. Finally, we discuss the relationship

between CSGL and other structure learning algorithms, and

note in particular its strong parallels with a version of the

hypergraph lifting approach (LHL) presented in (Kok and

Page 2

Domingos 2009).

Markov Logic and Deep Transfer

Terminology

In first-order logic, a constant refers to a particular object in

the domain. Variables stand in place of constants for the pur-

pose of quantification. A term is a constant or a variable. A

predicate represents a relation on objects in the domain and

is either true or false for any given list of objects; the arity

of a predicate is the number of arguments it takes. An atom

is a predicate applied to a list of terms, and a ground atom

is an atom in which all of the terms are constants. A clause

is a disjunction of atoms, any of which may be negated; a

formula may also include conjunction and implication. A

ground formula is a formula in which all of the atoms are

ground atoms. A world (or “possible world”) is an assign-

ment of truth values to all ground atoms in the domain.

Markov Logic Networks

Markov logic networks (MLNs), introduced by Richardson

and Domingos (2006), are a knowledge representation gen-

eralizing both first-order logic and probabilistic graphical

models. An MLN is a set of first-order formulas with as-

sociated weights. Given a finite set of constants, an MLN

may be represented as a Markov network (Pearl 1988) con-

sisting of a node for each ground atom in the domain and

an edge between any two nodes representing ground atoms

which appear together in the same ground formula. In this

representation, the cliques of the graph correspond roughly

to the ground formulas of the domain, and to each ground

formula j we assign a feature gj(x), which is defined to

be 1 if formula j is true in world x and 0 otherwise. We

also assign a weight wj equal to the weight of the corre-

sponding first-order formula in the MLN. Markov networks

model the joint probability distribution of a set of variables,

and so the Markov network created from a grounded MLN

can be interpreted as defining a probability distribution over

the set of all possible worlds X, given by P(X = x) =

1

Zexp

, where the sum is taken over all fea-

tures in the graph and Z is a normalization constant. This

formula can be written equivalently as

??

jwjgj(x)

?

P(X = x) =1

Zexp

??

i∈F

wini(x)

?

,

where F is the set of all first-order formulas in the MLN,

wiis the weight of the ith formula, and niis the number of

true groundings of the ith formula in possible world x; note

that this rendering is in terms of the first-order MLN struc-

ture rather than the underlying Markov network structure.

By generalizing the propositional representation of Markov

networks to include first-order logic, MLNs are capable of

succinctly and intuitively representing complex models in-

volving uncertainty in relational domains.

Several algorithms have been proposed for structure

learning in MLNs. MSL, proposed by Kok and Domin-

gos (2005), uses an inductive logic programming-style beam

search in which atoms to be added or removed from a clause

are evaluated for their effect on the weighted pseudo-log-

likelihood (WPLL, defined by Kok and Domingos 2005)

of the data. As a local search strategy, MSL has the use-

ful property that it can be used both to learn structure from

scratch and also to refine existing theories. More recently,

several algorithms have been proposed which may outper-

form MSL in learning performance, required training time,

or both; see Mihalkova and Mooney (2007), Biba, Ferilli,

and Esposito (2008), and Kok and Domingos (2009).

Deep Transfer

The goal of deep transfer learning, as described by Davis

andDomingos(2009), istoidentifydomain-independentab-

stract knowledge which can be used to bias the process of

structure learning in a destination domain. The process be-

gins with an MLN representing a source domain, abstracts

the MLN into second-order cliques which represent second-

order structure, identifies the cliques which correspond to

the most salient structural properties, and then uses those to

bias structure learning of an MLN in the target domain. The

initialsource-domainMLNcanbegeneratedbyanymethod,

but Davis and Domingos obtained their best results when us-

ing exhaustive search, which considers all possible clauses

under a maximum length and number of object variables.

Second-order cliques are simply sets of second-

order predicates which occur together in at least one

clause (e.g.the first-order clauses !Friends(x,y)

∨ Friends(y,x)

and

Enemies(y,x) each correspond to the same second-

order clique {r(x,y), r(y,x)}. To score a clique, the

algorithm examines each of its first-order instantiations.

Each instantiation is decomposed into all possible pairs of

subcliques, and each pair is assigned a score equal to the

K-L divergence D(p?q) =?

of the clique x, p(x) gives the probability that the clique

will be in state x if we instantiate its terms with constants

from the domain) and q is the distribution it would have

if the two subcliques were independent. The score of a

first-order instantiation is the minimum score of all of its

decompositions, and the score of the second-order clique

is the average of its top n instantiations, which encourages

cliques having multiple useful instantiations.

Once the scoring process is complete, DTM selects the

top k second-order cliques which have at least one true

grounding in the target domain, and instantiates them by

creating all corresponding first-order clauses in the target

domain. These clauses include all possible ways of flipping

thesignsoftheatomswithintheclique. Inthe“greedytrans-

fer with refinement” approach, which Davis and Domingos

found to be the most effective, DTM creates an MLN from

the resulting clauses by first greedily selecting clauses to im-

prove the WPLL until no additional clause does so, and then

applying MSL to refine the theory suggested by the greedy

procedure.

!Enemies(x,y) ∨

xp(x)logp(x)

q(x), where p is the

probability distribution of the entire clique (for any state

Page 3

DTM for Structure Learning

Although DTM can extract cliques from any set of formu-

las in the source domain, Davis and Domingos found that

the best results came from considering not the results of

beam search or other structure learning techniques, but a

simple exhaustive listing of all possible clauses in the do-

main within a maximum length and number of terms. They

argue that since the clique scoring process already suggests

the most useful cliques for transfer, there is no additional

benefit in trying to learn a theory in the source domain which

would restrict the cliques that are considered for transfer. In

this paper we will generally use DTM to refer to the form of

the algorithm in which exhaustive search is used to generate

cliques for transfer.

An interesting effect of the use of exhaustive search is

that unlike many transfer methods, DTM performs a differ-

ent sort of learning in the source domain than in the target

domain. DTM’s only interaction with the data of the source

domain is through the clique-scoring process, while in the

destination domain it uses a greedy selection process, fol-

lowed optionally by beam search, to refine the clauses pro-

duced through transfer. This raises the question of whether

the demonstrated benefits of transfer can be attributed en-

tirely to the use of source domain knowledge, or whether

the clique scoring process might be acting as a novel form

of structure learning in itself, providing a performance boost

independent of the particular structural knowledge being im-

ported from the source domain.

We propose a simple modification to DTM, which we call

“self-transfer”. The inspiration for self-transfer is related to

the persistent question in transfer learning of how to know

when two tasks are sufficiently related that transfer between

them will be worthwhile. For transfer to be at its most effec-

tive, we generally want the source and target tasks to be as

similar as possible. Since no two domains are more similar

to each other than a domain is to itself, we propose a method

that effectively applies DTM to a single domain as both

source and target. We first construct an exhaustive list of all

possible first-order clauses in the domain up to some maxi-

mum length and number of object variables. These clauses

are then abstracted into second-order cliques and scored via

the method described above. The k top-scoring cliques are

then instantiated back into first-order clauses of the domain.

From this set of clauses, we use the “greedy transfer with

refinement” method described by Davis and Domingos to

derive a final set of clauses indicating the structure of the do-

main. That is, we greedily select from the instantiated first-

order clauses until no additional clause improves the WPLL,

and then we refine the domain theory with MSL.

One might expect that using the same domain as both

source and target would forfeit some of the expected benefits

of transfer, for example the improved generalization perfor-

mance which often results from considering a broader class

of related tasks (Caruana 1997). Surprisingly, this does not

appear to be the case: in our experiments the results of self-

transfer were generally comparable to the best results from

DTM. While there may exist circumstances in which cross-

domain transfer through DTM is more effective (e.g. if the

target domain has very little data available, or if we wish to

True

Atoms

1039

1407

283489

15015

Total

Atoms

55722

174785

20663916

4533900

Types

5

8

3

7

Predicates

5

10

3

7

Constants

251

442

4953

2470

IMDB

UW-CSE

WebKB

Yeast Protein

Table 1: Datasets used.

learn structure for many domains using the same source do-

main to avoid repeating the clique-scoring process for each

domain), we demonstrate that in many cases the benefits of

transfer observed using DTM are achievable through simple

self-transfer.

This observation strongly suggests that much of the utility

of DTM comes, not from the ability to transfer knowledge

between domains, but from the introduction of a new learn-

ing process, namely the process of clique scoring in con-

junction with first-order instantiation and greedy selection.

In fact, we find that clique scoring with greedy selection

(CSGL) alone yields results equalling or exceeding those of

MSL.

Experiments

To evaluate the effectiveness of self-transfer relative to other

transfer cases, as well as the effectiveness of CSGL as a

standalone structure learning algorithm, we performed ex-

periments using data from several real-world domains. We

used the implementation of MSL in the publicly available

Alchemy package (Kok et al. 2009) along with an Alchemy-

based implementation of DTM provided by Jesse Davis.

Domains

For comparison with the results of Davis and Domingos

(2009), we used datasets representing the WebKB and Yeast

Protein domains. Both datasets were provided by Jesse

Davis.

WebKB. This dataset consists of labeled web pages from

the computer science departments of four universities, with

predicates indicating the words occurring on each page, the

class label of each page (faculty, student, course, etc.), and

the links between pages. The data from each university is

treated as a separate fold. We attempt to predict the truth

values of all groundings of the Linked and PageClass

predicates. The data is originally sourced from Craven and

Slattery (2001), and the version used both in this paper and

in (Davis and Domingos 2009) is equivalent to the version

publicly available at alchemy.cs.washington.edu,

with the following modifications: the seven single-arity

predicates FacultyPage, CoursePage, etc. indicating

the class label of a page are collapsed into a single predicate

PageClass of arity two, and only the Has, Linked, and

PageClass predicates are considered.

Yeast Protein.

This dataset contains information on

protein location, function, phenotype, class, and enzymes

within the yeast Saccharomyces cerevisiae, as well as pro-

tein interactions and protein complex data, all from the

Page 4

MIPS Comprehensive Yeast Genome Database as of Febru-

ary 2005 (Mewes et al. 2002). We used the version of the

data from Davis and Domingos, which is split into four dis-

joint subsamples which are used as folds, and we attempt to

predict the Interaction and Function predicates.

We also used two additional domains, both publicly avail-

able from alchemy.cs.washington.edu:

IMDB. This dataset describes a movie domain, consisting

of movies, actors, directors, etc. and predicates indicating

their relationships. The data was collected from imdb.com

by Mihalkova and Mooney (2007). The data is split into five

disjoint folds, but in order to maintain consistency with We-

bKB and Yeast, we used only the first four folds. We attempt

to predict the WorkedUnder and WorkedInGenre pred-

icates. Following Kok and Domingos (2009) we omitted

four equality predicates which are superseded by the equal-

ity operator available in Alchemy. In addition, for consis-

tency with WebKB, we collapsed the single-arity Actor

andDirectorpredicates intoasinglepredicate HasRole

with arity two (similar to PageClass in WebKB).

UW-CSE. This dataset, from Richardson and Domingos

(2006), describes anonymized relationships between stu-

dents, faculty, and courses in the University of Washing-

ton Computer Science and Engineering Department. The

data is split into five folds representing the sub-disciplines of

AI, graphics, programming languages, systems, and theory;

again for consistency we used only four folds, omitting the

systems data (chosen randomly). Following Richardson and

Domingos we attempted to predict the AdvisedBy pred-

icate. As with IMDB, we omitted nine redundant equality

predicates, and we collapsed the single-arity Student and

Professor into a single HasRank. We also simplified

the TaughtBy and TA predicates of UW-CSE to ignore the

particular quarter in which a course was taught, reducing

the arity of each of those predicates from three to two. This

change was motivated by the fact that DTM can only trans-

fer between predicates having the same arity; because there

are no arity-three predicates in our other datasets these pred-

icates would otherwise have been completely ignored by the

transfer process.

Details of all datasets are given in Table 1.

Experiment 1: DTM vs. Self-Transfer

Tocomparetheeffectivenessofcross-domaintransfertothat

of self-transfer, we used DTM to perform transfer between

all combinations of source and target domains, including

cases of self-transfer. Each dataset was divided into four

independent folds, on which we performed leave-one-out

cross-validation, training on every subset of three folds and

testing on the fourth. The results represent averages over the

four folds from each domain. For tractability on the WebKB

data, we followed Davis and Domingos in using information

gain on the training set to pick the fifty words most predic-

tive of page class; these were used to train and evaluate the

learned MLN.

In cases of self-transfer, we gathered and scored cliques

using only the training set, not the full dataset. Since in all

other transfer cases we gathered and scored cliques using all

data available from the source domain, this puts self-transfer

at a modest disadvantage in terms of the quantity of data

available during the learning process. That said, within each

domain the cliques which were identified in the three folds

of each training set did not differ significantly from those

obtained using the full four folds, with only minor changes

in ordering in most cases.

Following Davis and Domingos, we allowed MSL to

learn clauses containing constants. We permitted only role

(IMDB), rank (UW-CSE), function (Yeast), and page class

(WebKB) to appear as constants in learned clauses. We eval-

uated DTM with k = 5 and k = 10; because the results

show the same overall trends we report only the k = 10

results here. Source clauses for DTM were generated by

exhaustive search over all clauses containing at most three

literals and three object variables. MSL structure refinement

was time-limited to 20 hours for each trial. Following Kok

and Domingos (2005) and others, we evaluated each set of

learned formulas using thetestsetconditional log-likelihood

(CLL) and the area under the precision-recall curve (AUC).

The CLL has the advantage of directly measuring the qual-

ity of the probability estimates produced, while the AUC is

useful because it is insensitive to the large number of true

negatives in the data. The CLL is calculated by averaging

over all ground atoms the predicted log-likelihood that each

ground atom takes on its true value, given the learned do-

main theory and the truth values of all other ground atoms as

evidence. AUC is calculated by varying the threshold CLL

above which an atom is predicted to be true.

Experiment 2: CSGL vs. MSL

We measured the performance of clique scoring with greedy

selection as a standalone structure learning algorithm by

comparing its performance to that of MSL. Results for

CSGL on each domain were obtained by omitting the re-

finement step from the self-transfer results of the previous

section, that is, by combining exhaustive search with clique

scoring followed by greedy selection. Evaluation was per-

formed identically to the previous section; we report results

for k = 5 and k = 10.

Results

Table 2 gives the AUC and CLL for all transfer scenarios

using refinement, including self-transfer results (which are

underlined) as well as MSL, which acts as a baseline. Each

figure represents an average over the four different train/test

trials. Note that the results for transfer from WebKB and

Yeast are identical. This is because the ten highest-scoring

cliques are the same in both domains (see Figure 4 for a list-

ing of the top cliques in each domain), so deep transfer pro-

duces the same theories when transferring from either do-

main. Also note that the top five cliques are also identical

across WebKB and Yeast.

The results in Table 2 support the claims of Davis and

Domingos, in that DTM improves on MSL in almost all

cases. Examining self-transfer in particular, we see that

self-transfer is at least competitive with other transfer sce-

narios on all predicates except PageClass (where the de-

ficiency is due to a single outlier trial, in which refinement

of the results from greedy selection led to a large decrease

Page 5

IMDB

0.63

0.77

0.08

0.01

0.86

0.34

0.04

UW-CSE

0.61

0.23

0.08

0.01

0.86

0.34

0.04

WebKB

0.61

0.21

0.08

0.09

0.68

0.33

0.10

Yeast

0.61

0.21

0.08

0.09

0.68

0.33

0.10

MSL

0.32

0.03

0.04

0.004

0.87

0.27

0.04

WorkedInGenre

WorkedUnder

AdvisedBy

Linked

PageClass

Function

Interaction

(a) AUC

IMDB

-0.20

-0.09

-0.03

-0.02

-0.07

-0.18

-0.04

UW-CSE

-0.13

-0.17

-0.03

-0.02

-0.07

-0.18

-0.04

WebKB

-0.37

-0.21

-0.03

-0.02

-0.12

-0.18

-0.03

Yeast

-0.37

-0.21

-0.03

-0.02

-0.12

-0.18

-0.03

MSL

-0.30

-0.23

-0.04

-0.02

-0.07

-0.19

-0.04

WorkedInGenre

WorkedUnder

AdvisedBy

Linked

PageClass

Function

Interaction

(b) CLL

Table 2: Results for DTM vs. self-transfer (underlined).

CSGL-5

0.70

0.26

0.04

0.06

0.86

0.31

0.10

CSGL-10

0.63

0.69

0.06

0.06

0.86

0.31

0.10

MSL

0.32

0.03

0.04

0.004

0.87

0.27

0.04

WorkedInGenre

WorkedUnder

AdvisedBy

Linked

PageClass

Function

Interaction

(a) AUC

CSGL-5

-0.16

-0.14

-0.04

-0.02

-0.07

-0.17

-0.03

CSGL-10

-0.15

-0.11

-0.03

-0.02

-0.07

-0.17

-0.03

MSL

-0.30

-0.23

-0.04

-0.02

-0.07

-0.19

-0.04

WorkedInGenre

WorkedUnder

AdvisedBy

Linked

PageClass

Function

Interaction

(b) CLL

Table 3: Results for CSGL (top 5 and top 10 cliques) vs. MSL.

in AUC), and that in fact it performs significantly better

than all other methods in AUC when predicting the pred-

icate WorkedUnder (paired one-tail t-test, p < 0.05).

For no predicate does the best case of cross-domain trans-

fer perform significantly better, in AUC or CLL, than self-

transfer does (paired one-tail t-test, p > 0.10). This is con-

sistent with our claim that the performance gains of DTM

over MSL are not related to DTM’s incorporation of source-

domain knowledge.

Note that self-transfer generally matches or outperforms

other transfer settings despite the limitation of having only

the three folds of the training set from which to identify

the top cliques, as opposed to using the full four folds of

source domain data which are available to the other transfer

scenarios. Also recall that we modified the logical struc-

ture of each dataset so that all of its predicates had arity

two, thusgivingDTMthegreatestpossiblefreedomtotrans-

fer structure between all predicates. If we had allowed the

single-arity Student and Professor predicates to re-

main uncollapsed in the UW-CSE dataset, for example, then

DTM would have been unable to relate them to the analo-

gous PageClass predicate in WebKB because it has arity

two. By contrast, self-transfer generates cliques with arities

appropriate to the predicates of each dataset.

Table 3 compares CSGL to MSL as standalone structure

learning algorithms, with two versions of CSGL instantiat-

ing the top five and ten highest-scoring cliques respectively.

Results from CSGL-5 and CSGL-10 were generally compa-

rable, although CSGL-10 fared much better when predict-

ing WorkedUnder. Note that CSGL-10 beats MSL in ev-

ery case except for the PageClass predicate of WebKB,

for which the two methods give approximately equal re-

sults. CGSL-10 also performs comparably to 10-clique self-

transfer in most cases, and substantially better in the case

PageClass, indicating that the additional, costly refine-

ment step required by self-transfer may not be necessary in

order to achieve satisfactory results.

Related Work

Severalstructurelearningalgorithmshavebeenproposedfor

Markov logic networks, but of particular relevance here is

LHL, the hypergraph lifting approach described by Kok and

Domingos (2009). This is because the unlifted variant of

LHL, known as LHL-FindPaths, bears remarkable similari-

ties to CSGL in its general structure. Like LHL-FindPaths,

CSGL is a bottom-up structure learner which constructs

clauses directly from the data. The exhaustive search step

used by CSGL to generate initial clauses is equivalent to the

process in LHL-FindPaths of enumerating and variabilizing

pathsintheunliftedhypergraph, exceptfortheaddedrestric-

tion that every conjunction which LHL-FindPaths considers

must have at least one support in the data. Both methods

evaluate clauses according to how well they represent struc-

turalregularitiesnotfoundintheirsub-clauses; inCSGLthis

is implemented by the clique-scoring process in which all

but the top-scoring cliques are discarded, while in LHL this

is done by simply discarding any clause having a WPLL less

than one of its subclauses. Both methods consider as candi-

dates many combinations of negated and non-negated atoms

in the clauses that they generate; in CSGL this is part of the

clique abstraction and instantiation process, while LHL ex-

plicitly constructs partially-negated variants of its clauses.

Finally, both methods arrive at the final MLN structure by

greedily selecting clauses from a list of candidates until no