Page 1

Towards a Machine-Learning

Architecture for Lexical Functional

Grammar Parsing

Grzegorz Chrupa? la

A dissertation submitted in fulfilment of the requirements for the award of

Doctor of Philosophy (Ph.D.)

to the

Dublin City University

School of Computing

Supervisor: Prof. Josef van Genabith

April 2008

Page 2

Declaration

I hereby certify that this material, which I now submit for assessment on the pro-

gramme of study leading to the award of Doctor of Philosophy (Ph.D.) is entirely my

own work, that I have exercised reasonable care to ensure that the work is original, and

does not to the best of my knowledge breach any law of copyright, and has not been

taken from the work of others save and to the extent that such work has been cited

and acknowledged within the text of my work.

Signed

(Grzegorz Chrupa? la)

Student ID 55130089

Date April 2008

i

Page 3

Contents

1 Introduction1

1.1 Shallow vs Deep Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Deep Data-Driven Parsing . . . . . . . . . . . . . . . . . . . . . . . . . .2

1.3 Multilingual Treebank-Based LFG . . . . . . . . . . . . . . . . . . . . .3

1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

1.5 The Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . .6

1.6 Summary of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . .7

2Treebank-Based Lexical Functional Grammar Parsing9

2.1Lexical Functional Grammar . . . . . . . . . . . . . . . . . . . . . . . .9

2.2LFG parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1Treebank-based LFG parsing . . . . . . . . . . . . . . . . . . . . 15

2.3GramLab – Treebank-Based Acquisition of Wide-Coverage LFG Resources 20

3 Machine Learning22

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

3.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . .22

3.1.2 Feature representation . . . . . . . . . . . . . . . . . . . . . . . .23

3.2Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

3.2.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24

3.2.2 K-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

3.2.3 Logistic Regression and MaxEnt . . . . . . . . . . . . . . . . . .30

3.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . .36

3.3 Sequence Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42

ii

Page 4

3.3.1 Maximum Entropy Markov Models . . . . . . . . . . . . . . . . .43

3.3.2 Conditional Random Fields and other structured prediction meth-

ods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

4 Treebank-Based LFG Parsing Resources for Spanish47

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

4.1.1 The Cast3LB Spanish treebank . . . . . . . . . . . . . . . . . . .47

4.2 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . . . . . .48

4.3 Improving Spanish LFG Resources . . . . . . . . . . . . . . . . . . . . .52

4.3.1 Clitic doubling and null subjects . . . . . . . . . . . . . . . . . .52

4.3.2 Periphrastic constructions . . . . . . . . . . . . . . . . . . . . . .55

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

5 Learning Function Labels60

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60

5.2 Learning Cast3LB Function Labels . . . . . . . . . . . . . . . . . . . . .61

5.2.1 Annotation algorithm . . . . . . . . . . . . . . . . . . . . . . . .61

5.2.2 Previous work on learning function labels . . . . . . . . . . . . .64

5.2.3Assigning Cast3LB function labels to parsed Spanish text . . . . 64

5.2.4 Cast3LB function label assignment evaluation . . . . . . . . . . .69

5.2.5 Task-based LFG annotation evaluation . . . . . . . . . . . . . . .72

5.2.6 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

5.2.7 Adapting to the AnCora-ESP corpus . . . . . . . . . . . . . . . .75

5.3 Improving Training for Function Labeling by Using Parser Output . . .78

5.3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

5.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . .87

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92

6 Learning Morphology and Lemmatization 94

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94

6.1.1 Main results obtained . . . . . . . . . . . . . . . . . . . . . . . .94

iii

Page 5

6.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95

6.2.1 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . .95

6.2.2 Memory-based learning . . . . . . . . . . . . . . . . . . . . . . .99

6.2.3 Analogical learning . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.4Morphological tagging and disambiguation . . . . . . . . . . . . 103

6.3 Simple Data-Driven Context-Sensitive Lemmatization . . . . . . . . . . 104

6.3.1 Lemmatization as a classification task . . . . . . . . . . . . . . . 104

6.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3.3 Evaluation results and error analysis . . . . . . . . . . . . . . . . 109

6.3.4Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4 Morfette – a Combined Probabilistic Model for Morphological Tagging

and Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.4.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.4.2 The Morfette system . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.4.3Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.4.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.4.5 Integrating lexicons . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4.6 Improving lemma class discovery . . . . . . . . . . . . . . . . . . 130

6.4.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.5 Morphological Analysis and Synthesis: ILP and Classifier-Based Ap-

proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.5.2 Model and features . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.5.3 Results and error analysis . . . . . . . . . . . . . . . . . . . . . . 138

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7 Conclusion142

7.1 Summary of Main Contributions . . . . . . . . . . . . . . . . . . . . . . 142

7.2 Directions for Future Research. . . . . . . . . . . . . . . . . . . . . . . 144

7.2.1 Grammatical functions . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.2 Morphology and Morfette . . . . . . . . . . . . . . . . . . . . . 145

7.2.3 Other aspects of LFG parsing . . . . . . . . . . . . . . . . . . . . 147

iv

Page 6

List of Figures

2.1 LFG representation of But stocks kept falling . . . . . . . . . . . . . . .12

2.2 Pipeline LFG parsing architecture. . . . . . . . . . . . . . . . . . . . .18

3.1 Averaged Perceptron algorithm . . . . . . . . . . . . . . . . . . . . . . .25

3.2 Example separating hyperplanes in two dimensions . . . . . . . . . . . .26

3.3 Separating hyperplane and support vectors. . . . . . . . . . . . . . . .37

3.4 Two dimensional classification example, non-separable in two dimen-

sions, becomes separable when mapped to 3 dimensions by (x1,x2) ?→

(x2

1,2x1x2,x2

2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

4.1 On top flat structure of S. Cast3LB function labels are shown in bold.

Below the corresponding (simplified) LFG f-structure. Translation: Let

the reader not expect a definition. . . . . . . . . . . . . . . . . . . . . .49

4.2 Comparison of f-structure representations for NPs . . . . . . . . . . . .50

4.3 Comparison of f-structure representations for copular verbs . . . . . . .51

4.4 Periphrastic construction with two light verbs: The treebank tree, and

the f-structure produced . . . . . . . . . . . . . . . . . . . . . . . . . . .57

4.5 Treatment of periphrastic constructions by means of functional uncer-

tainty equations with off-path constraints . . . . . . . . . . . . . . . . .58

5.1 Examples of features extracted from an example node . . . . . . . . . .68

5.2 Learning curves for TiMBL (t), MaxEnt (m) and SVM (s). . . . . . . .69

5.3 Subject - Direct Object ambiguity in a Spanish relative clause. . . . . .74

5.4 Algorithm for extracting training instances from a parser tree T and gold

tree T?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

v

Page 7

5.5 Example gold and parser tree . . . . . . . . . . . . . . . . . . . . . . . .85

6.1 Instance for task 2 in Stroppa and Yvon (2005) . . . . . . . . . . . . . . 101

6.2 Features extracted for the MSD-tagging model from an example Roma-

nian phrase:ˆIn peret ¸ii boxei erau trei orificii. . . . . . . . . . . . . . . . 118

6.3 Background predicate mate/6 . . . . . . . . . . . . . . . . . . . . . . . . 138

vi

Page 8

List of Tables

2.1 LFG Grammatical functions . . . . . . . . . . . . . . . . . . . . . . . . .10

5.1 Features included in POS tags. Type refers to subcategories of parts of

speech such as e.g. common and proper for nouns, or main, auxiliary

and semiauxiliary for verbs. For details see (Civit, 2000). . . . . . . . .65

5.2 C-structure parsing performance. . . . . . . . . . . . . . . . . . . . . .66

5.3 Cast3LB function labeling performance for gold-standard trees (Node

Span) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70

5.4 Cast3LB function labeling performance for parser output (Node Span:

correctly parsed constituents) . . . . . . . . . . . . . . . . . . . . . . . .71

5.5 Cast3LB function labeling performance for parser output (Headword) .71

5.6 Statistical significance testing results on for the Cast3LB tag assignment

on parser output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

5.7 LFG F-structure evaluation results (preds-only) for parser output . . . .72

5.8 Simplified confusion matrix for SVM on test-set gold-standard trees. The

gold-standard Cast3LB function labels are shown in the first row, the

predicted tags in the first column. So e.g. suj was mistagged as cd in 26

cases. Low frequency function labels as well as those rarely mispredicted

have been omitted for clarity. . . . . . . . . . . . . . . . . . . . . . . . .74

5.9 C-structure parsing performance for Cast3LB . . . . . . . . . . . . . . .76

5.10 C-structure parsing performance for AnCora . . . . . . . . . . . . . . . .77

5.11 Cast3LB function labeling performance for parser output (Node Span:

correctly parsed constituents) . . . . . . . . . . . . . . . . . . . . . . . .77

vii

Page 9

5.12 AnCora function labeling performance for parser output for correctly

parsed constituents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77

5.13 LFG F-structure evaluation results (preds-only) for parser output for

Cast3LB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

5.14 LFG F-structure evaluation results (preds-only) for parser output for

AnCora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

5.15 Function labels in the English and Chinese Penn Treebanks . . . . . . .80

5.16 Instance counts and instance overlap against test for the English Penn

Treebank training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86

5.17 Mean Hamming distance scores for the English Penn Treebank training

set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87

5.18 Function labeling evaluation on parser output for WSJ section 23 - La-

beled Node Span . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

5.19 Function labeling evaluation on parser output for WSJ section 23 - Head-

word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

5.20 Per-tag performance of baseline and when training on reparsed trees -

Labeled Node Span . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90

5.21 Function labeling evaluation for the CTB on the parser output for the

development set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91

5.22 Function labeling evaluation for the CTB on the parser output for the

test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91

6.1 Morphological synthesis and analysis performance in (Manandhar et al.,

1998) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98

6.2 Results for task 1 in Stroppa and Yvon (2005). . . . . . . . . . . . . . 102

6.3 Results for task 2 in Stroppa and Yvon (2005). . . . . . . . . . . . . . 102

6.4 Feature notation and description for lemmatization . . . . . . . . . . . . 107

6.5 Example features for lemmatization extracted from a Spanish sentence . 108

6.6 Lemmatization evaluation for eight languages . . . . . . . . . . . . . . . 109

6.7 Lemmatization evaluation for eight languages – unseen word forms only 110

6.8 Comparison of reverse-edit-list+SVM to Freeling on the lemmati-

zation task for Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

viii

Page 10

6.9 Comparison of reverse-edit-list+SVM to Freeling on the lemmati-

zation task for Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.10 Statistical significance test . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.11 Feature notation and description for the basic configuration . . . . . . 117

6.12 Evaluation results with the basic model with small training set for

Spanish, Romanian and Polish. . . . . . . . . . . . . . . . . . . . . . . 121

6.13 Evaluation results with a full training set for Spanish and Polish. Num-

bers in brackets indicate accuracy improvement over the same model

trained on the small training set . . . . . . . . . . . . . . . . . . . . . . 121

6.14 Evaluation results of the basic+dict model with the small training set

with lexicons of various sizes for Spanish. Numbers in brackets indicate

accuracy improvement over the basic model with the same training set 127

6.15 Evaluation results of the basic+dict model with the full training set

with lexicons of various sizes for Spanish. Numbers in brackets indicate

accuracy improvement over the basic model with the same training set 127

6.16 Evaluation results for Freeling with two different dictionaries . . . . . . 129

6.17 Evaluation results for Morfette in two configurations. The numbers in

brackets indicate improvement over Freeling with dict-large

. . . . . 129

6.18 Results for the basic feature set on small training set, using the edit-

tree as lemma class for Polish. Numbers in brackets indicate improve-

ment over the same configuration with reverse-edit-list

. . . . . . . 133

6.19 Results for the basic feature set, using the edit-tree as lemma class

for Welsh and Irish. Numbers in brackets indicate improvement over the

same configuration with reverse-edit-list. . . . . . . . . . . . . . . . 134

6.20 Features for lexical analysis model . . . . . . . . . . . . . . . . . . . . . 137

6.21 Features for lexical synthesis model . . . . . . . . . . . . . . . . . . . . . 137

6.22 Morphological analysis results - all . . . . . . . . . . . . . . . . . . . . . 139

6.23 Morphological synthesis results - all . . . . . . . . . . . . . . . . . . . . 139

6.24 Morphological analysis results - seen . . . . . . . . . . . . . . . . . . . . 139

6.25 Morphological analysis results - unseen . . . . . . . . . . . . . . . . . . . 139

6.26 Morphological synthesis results - seen . . . . . . . . . . . . . . . . . . . 139

ix

Page 11

6.27 Morphological synthesis results - unseen . . . . . . . . . . . . . . . . . . 140

x

Page 12

Acknowledgments

The work I carried out during the 3 years of my PhD at DCU would not have been

possible without the support of many colleagues and friends. First, I’d like to say many

thanks to Josef van Genabith who was an enthusiastic supervisor, always interested in

my ideas and ready to suggest new ones whenever I got stuck. Josef’s endless optimism

and positive attitude were a most welcome antidote to my doubts and skepticism.

There are two people who helped shape my thinking and my work in multiple ways:

my co-authors and friends Nicolas Stroppa and Georgiana Dinu. I am grateful to Nico

for sharing with me his expertise in both the technical details of, and the guiding con-

cepts behind Machine Learning during innumerable coffee breaks. Georgiana served as

a tireless sounding board: I would have never been able to fully flesh out my ideas with-

out constantly sharing them with her and hearing what she thought. Georgiana also

read parts of the thesis and helped remove many mistakes and unclear points. I would

also like to thank both Nicolas and Georgiana for the effort they put in collaborating

with me on joint papers: it was a pleasure to work with you.

I would also like to thank the co-members of the GramLab project: Ines Rehbein,

Yuqing Guo, Masanori Oya and Natalie Schluter, as well as other researchers at NCLT:

Joachim Wagner, Yvette Graham and Jeniffer Foster. Thanks for talking to me, going

through the routine of weekly meetings together, listening and giving suggestions at

seminar talks and dry-runs! Other researches who I would like to thank for their helpful

suggestions and/or generally inspiring conversations are Aoife Cahill, John Tinsley and

Augusto Jun Devegili.

¨Ozlem C ¸etinoˇ glu helped to make this thesis better by being always ready to listen

to me and offer advice. She also proof-read parts of the text and helped to clarify it.

A special round of thanks goes to two of my friends and colleagues at DCU: Bart

xi

Page 13

Mellebeek and Djam´ e Seddah. They were great colleagues, always ready to listen and

help out with research questions. They are also my best friends, and doing a PhD in

Dublin would be a less rewarding and duller experience without all the great times we

had together: thanks guys!

I’d like to say a big thank you to Eva Mart´ ınez Fuentes who put up with, and even

shared and enjoyed the bizarre interests and social life of a PhD student. Thank you

for your support and friendship.

The final few months of a PhD program are a notoriously difficult time: they were

made much more enjoyable by the endless stimulating chats about science, life and

everything with Anke Dietzsch. There is nothing better to renew one’s energies than

the company of a smart biologist: thank you Anke.

Finally, I would like to express my gratitude to the Science Foundation Ireland who

supported my research with grant 04/IN/I527.

xii

Page 14

Abstract

Data-driven grammar induction aims at producing wide-coverage grammars of human

languages. Initial efforts in this field produced relatively shallow linguistic representa-

tions such as phrase-structure trees, which only encode constituent structure. Recent

work on inducing deep grammars from treebanks addresses this shortcoming by also

recovering non-local dependencies and grammatical relations. My aim is to investigate

the issues arising when adapting an existing Lexical Functional Grammar (LFG) induc-

tion method to a new language and treebank, and find solutions which will generalize

robustly across multiple languages.

The research hypothesis is that by exploiting machine-learning algorithms to learn

morphological features, lemmatization classes and grammatical functions from tree-

banks we can reduce the amount of manual specification and improve robustness, ac-

curacy and domain- and language -independence for LFG parsing systems.

Function labels can often be relatively straightforwardly mapped to LFG grammat-

ical functions. Learning them reliably permits grammar induction to depend less on

language-specific LFG annotation rules. I therefore propose ways to improve acquisition

of function labels from treebanks and translate those improvements into better-quality

f-structure parsing.

In a lexicalized grammatical formalism such as LFG a large amount of syntactically

relevant information comes from lexical entries. It is, therefore, important to be able

to perform morphological analysis in an accurate and robust way for morphologically

rich languages. I propose a fully data-driven supervised method to simultaneously

lemmatize and morphologically analyze text and obtain competitive or improved results

on a range of typologically diverse languages.

Page 15

Chapter 1

Introduction

Natural Language Processing (NLP) seeks to develop methods which make it possible

for computers to deal with human language texts in a meaningful and useful fashion.

Unstructured textual information written by and for humans is ubiquitous and being

able to make sense of it in an automated fashion is highly desirable.Many NLP

applications can benefit if they are able to automatically associate syntactic and/or

semantic structure with natural language text, i.e. to parse it.

1.1 Shallow vs Deep Parsing

Traditionally, approaches to parsing within NLP fell into two types. First, parsing

can be performed by having expert linguists develop a computational grammar for a

given language, which can then be used by a parsing engine to assign a set of analyses

to a sentence. Typically, such a grammar would be based on some sufficiently formal

and explicit theory of language syntax and semantics, and would provide linguistically

well-motivated and rich representations of syntactic structure.

Second, grammars, or more generally parsing models, can be extracted automati-

cally from a large corpus annotated by expert linguists (a treebank). Typically such

a grammar would tend to be a relatively simple, relatively theory-neutral, and would

provide rather shallow syntactic representations.1

However it would have access to

1In this context, by “shallow parsing” I mean finding a basic constituent structure for a sentence.

I do not mean partial parsing, or chunking, where only a simple flat segmentation is imposed on the

sentence.

1

Page 16

frequency counts of different structures in the training corpus, which can be used for

managing ambiguities pervasive in natural language syntax.

1.2Deep Data-Driven Parsing

In more recent years significant effort has been put into overcoming this dichotomy and

superseding the tradeoffs it imposes. A number of systems have been developed which

combine the use of linguistically sophisticated, rich models of syntax and semantics with

the data-driven methodology informed by probability theory and machine-learning.

Such “deep data-driven parsing” approaches combine the best of both worlds: they

offer wide-coverage and robustness coupled with linguistic accuracy and depth. The

developments in this area come in a few flavors.

First, shallow probabilistic models have been “deepened”. Many of the complexities

which make natural language syntax difficult, such as long-distance dependencies, were

ignored in shallow approaches early on; however, this need not be the case: treatment

of wh-extraction was incorporated into the Model 3 of Collins parser (Collins, 1997).

Second, many ways have been found to “enrich” the output of shallow parsers with

extra information. Examples include adding function labels (to be discussed in Chapter

5) or resolving long-distance dependencies, e.g.: (Johnson, 2001; Levy and Manning,

2004; Campbell, 2004; Gabbard et al., 2006).

Third, parsers using hand-written grammars have been equipped with probabilistic

disambiguation models trained on annotated corpora (Riezler et al., 2001; Kaplan et al.,

2004; Briscoe and Carroll, 2006). This does not solve the problem of limited coverage

those grammars have, but does provide a principled way to rank alternative analyses.

Limited coverage has been addressed in these systems by implementing robustness

heuristics such as combining partial parses as described by Kaplan et al. (2004).

Finally, standard annotated corpora have been used to train data-driven parsers for

deep linguistic formalisms such as Tree Adjoining Grammar (Xia, 1999), Lexical Func-

tional Grammar (Cahill et al., 2002, 2004), Head-driven Phrase Structure Grammar

(Miyao et al., 2003; Miyao and Tsujii, 2005) and Combinatory Categorial Grammar

(Clark and Hockenmaier, 2002; Clark and Curran, 2004).

2

Page 17

1.3 Multilingual Treebank-Based LFG

The research described in this thesis was carried out in the context of the GramLab

project which aims to develop resources for wide-coverage multilingual Lexical Func-

tional Grammar parsing.

Initial work on data-driven LFG parsing for English was done by Cahill et al. (2002,

2004) at Dublin City University (DCU). LFG has two parallel syntactic representa-

tions: constituency trees (c-structures) and representations of dependency relations

(f-structures). The DCU approach develops an LFG annotation algorithm which adds

information about LFG grammatical functions and other attributes to English Penn II

treebank-style trees. These annotations can be used to build LFG-style representations

of dependency relations (f-structures). The approach builds LFG representations in

two steps: c-structures are constructed by a probabilistic parsing model trained on a

treebank, then the trees are automatically annotated and the f-structures are built. It

has been demonstrated that this method can successfully compete with parsing sys-

tems which use large hand-written grammars developed over many years, on their own

evaluation data (Burke et al., 2004a; Cahill et al., 2008).

This empirical success provided the motivation for adapting the approach to other

languages. Appropriate training resources, i.e. large, syntactically annotated treebanks

are now available for many languages. However, the challenge of multilinguality is

not only the availability of resources but also the variation across human languages.

Languages differ along a number of dimensions, and often trade off complexity in one

linguistic subsystem for simplicity in another.

Computational language processing follows the standard scientific practice of re-

ductionism, and adopts simplifying assumptions about its object of study that may in

general be untrue but enable incremental progress to be made. Such simplifications are

often unstated and may be difficult to identify until our methods are stress-tested on

diverse data. And multilingual processing is one scenario where our assumptions may

need to be revised.

One aspect of the research described in this thesis is adapting the DCU treebank-

based LFG parsing architecture to the Spanish Cast3LB treebank. This exercise, as

well as work on other languages by members of the GramLab project, illuminated

3

Page 18

a number of linguistic divergences relevant for processing.The two most relevant

divergences between English and a language such as Spanish are along the dimension

of configurationality and morphological richness.

While English has highly constrained constituent order, and grammatical function

of constituents is highly determined by syntactic configuration, in Spanish the order

of main sentence constituents is governed by soft preferences depending on multiple

factors, and grammatical function is less predictable from configuration.

The syntactic rigidity of English goes hand in hand with little inflectional morphol-

ogy. Spanish is morphologically much richer than English (although of course Spanish

morphology is still quite limited compared to Slavic languages or to Arabic).

The syntactic flexibility of a language like Spanish makes it problematic to rely

heavily on a hand-written annotation algorithm which attempts to assign LFG gram-

matical function annotations to constituents in a parse tree. What is needed is a method

which draws information from many sources, such as local configuration, word order,

morphological features, lexical items, semantic features (e.g. animacy) and combines

the evidence to arrive at the final decision.

Rich morphology makes it necessary to use a step of morphological analysis more

complex than simple Part-of-Speech (POS) tagging prior to syntactic analysis. Ac-

curate morphological analysis is important for a deep lexicalized formalism like LFG

where morphological features such as agreement and case are used to constrain possible

syntactic analyses, and where normalized, lemmatized forms of lexical items are used

to build dependency relations.

Obviously we would like to learn to perform those two tasks, namely assigning

grammatical functions to nodes in parse trees and assigning morphological features and

lemmas to words in context, from training data for a particular language. Treebanks are

annotated with information which can be exploited to learn those tasks: they typically

enrich phrase-structure annotation with some grammatical function labels and some

semantic role labels. They are also typically morphologically analyzed and lemmatized

(and additionally there are other morphologically analyzed corpora that can be used

for training).

The driving idea in this thesis is to improve data-driven LFG parsing by making it

4

Page 19

more data-driven: learn more, and hardcode less. Learning to reliably assign function

labels from training data shifts the weight away from a hand-written LFG annotation

algorithm. For a language like Spanish, an annotation algorithm without access to

accurate function labels would work very poorly: in this case learning from data is

a necessity rather than just an improvement. Similarly, for languages with pervasive

inflectional phenomena, accurate and complete morphological analysis is a must. Even

though this can, and has been, achieved by hand writing finite-state analysers, here I

will adhere to the data-driven approach and determine how much and how well can be

learned from annotated data.

1.4Machine Learning

Machine Learning (ML) is the solution to many of the issues outlined in the previous

section: supervised learning methods allow us to find in our training data correlations

which can be exploited for predicting the phenomena we are interested in, such as a

constituent’s grammatical function, or the morphological features of a word in context.

We extract such hints, or features, from the data, and learn how much and in what way

they contribute to the final prediction; in other words we learn the model parameters.

When we apply the learned model to new data, we obtain a prediction, possibly with

an associated probability or other score indicating how confident we can be in it, which

means we have a well-motivated means of predicting combinations of outcomes, such

as e.g. sequences of morphological labels, using standard techniques from probability

theory.

The most explored setting within supervised machine learning is classification,

where the task is to use a collection of labeled training examples in order to learn

a function which can predict labels for new, unseen examples. Despite its simplicity

this paradigm is remarkably versatile and can be applied to a wide variety of prob-

lems. It can also be extended to learn functions with more complex codomains, such

as sequences of labels.

The ML algorithms used in this thesis fall into the class of discriminative methods,

which model the dependence between the unobserved variable y (the output) on the

observed variable x (the input); in probabilistic terms they describe the conditional

5

Page 20

probability distribution p(y|x), rather than the joint distribution p(x,y) used by gen-

erative models. Discriminative approaches allow us to define rich, fine-grained descrip-

tions of the input objects in terms of arbitrary, possibly non-independent features. This

makes discriminative modeling flexible and empirically successful in countless domains,

including many NLP applications.

In the research described here I use machine learning techniques for classification

and for sequence labeling to enhance the two crucial aspects of data-driven LFG parsing

discussed in the previous section: function labeling, and morphological analysis.

1.5 The Structure of the Thesis

The presentation of my research is organized as follows:

Chapter 2

gives a brief introduction to the aspects of Lexical Functional Grammar

most relevant to parsing natural language, and proceeds to give an overview of existing

work on data-driven treebank-based LFG parsing.

Chapter 3

is a high-level overview of the main aspects of supervised machine learn-

ing. I describe feature vector representations, and introduce several commonly used

learning algorithms, starting with the Perceptron and continuing with k-NN , Maxi-

mum Entropy and Support Vector Machines. Finally I briefly discuss approaches to

sequence labeling.

Chapter 5

presents my work on learning models for assigning function labels to

parser output. I start by giving a summary of my work on adapting the LFG parsing

architecture to Spanish which was the main motivation for developing a classifier-based

function labeler. In Section 5.2 I then describe experiments with three ML methods on

the Spanish Cast3LB treebank, report the evaluation results and error analysis; I also

briefly describe experiments on the more recent AnCora Spanish treebank. In Section

5.3 I describe an improved method of learning a function labeling model by making use

of parser output rather than original treebank trees for training, and report evaluation

results using such a model on English and Chinese.

6

Page 21

Chapter 6

deals with the task of learning morphological analysis models for lan-

guages with rich inflectional morphology. I start by reviewing existing research on

supervised learning of morphology. I discuss in some detail approaches based on In-

ductive Logic Programming (ILP) and Analogical Learning (AL), as well as a number

of other methods. I introduce a classifier-based method to learn lemmatization models

by means of using edit scripts between form-lemma pairs as class labels to be learned.

I report on experiments using this method on data from six languages. I proceed to

introduce the Morfette system which uses the Maximum Entropy approach to learn

a morphological tagging model and a lemmatization model and combines their predic-

tions to assign a sequence of morphological tags and lemmas to sentences. I report on

experiments using this system on Spanish, Romanian and Polish. Finally, I compare

the performance of the classifier-based method to morphological analysis and synthesis

with an ILP implementation Clog on data from the Multext-EAST corpus.

Chapter 7

summarizes the main contributions of this thesis and discusses ideas for

refining and extending the research described in the preceding chapters.

1.6 Summary of Main Results

The main results described in Chapters 5 and 6 are the following:

Spanish treebank-based LFG parsing

• I have overhauled and substantially extended the range of phenomena treated in

the Spanish annotation algorithm. I also revised and extended the gold standard

which now includes 338 f-structures. This served two purposes: to identify areas

where the existing LFG parsing architecture for English needed further work to

make it less language dependent and more portable, and to enable the work on

developing and evaluating a function labeling model for Spanish.

Function labeling

• I have developed a function labeler for Spanish which achieves a relative error

reduction of 26.73% over the previously used method of using the c-structure

7

Page 22

parser to obtain function-labeled trees. The use of this model in the LFG parsing

pipeline also improves the f-structure quality as compared to the baseline method.

• I have described a training regime for an SVM-based function labeling model

where trees output by a parser are used in combination with treebank trees in

order to achieve better similarity between training and test examples. This model

outperforms all previously described function labelers on the standard English

Penn II treebank test set (22.73% relative error reduction over previous highest

score).

Morphological analysis

• I have developed a method to cast lemmatization as a sequence labeling task.

It relies on the notion of edit script which encodes the transformations needed

to perform on the word form to convert it into the corresponding lemma. A

lemmatization model can be learned from a corpus annotated only with lemmas,

with no explicit part-of-speech information.

• I have built the Morfette system which performs morphological analysis by learn-

ing a morphological tagging model and a lemmatization model, and combines the

predictions of those two models to find a globally good sequence of MSD-lemma

pairs for a sentence.

• I have shown that integrating information from morphological dictionaries into

the Maximum Entropy models used by Morfette is straightforward and can

substantially reduce error, especially on words absent from training corpus data.

• I have developed an instantiation of the edit script, the Edit Tree, which im-

proves lemmatization class induction in the case where inflectional morphology

affects word beginnings in addition to word endings, and have shown that the

use of this edit script version results in statistically significant error reductions

on test data in Polish, Welsh and Irish.

• I compared the proposed morphology models against existing systems (Freeling

and Clog): in both cases my proposed models showed superior or competitive

performance

8

Page 23

Chapter 2

Treebank-Based Lexical

Functional Grammar Parsing

In this chapter I provide an overview of the Lexical Functional Grammar (LFG) and

discuss approaches to parsing natural language within the LFG framework.I will

concentrate on the aspects of LFG most relevant to computational implementations.

2.1 Lexical Functional Grammar

Lexical Functional Grammar is a formal theory of language introduced by Bresnan

and Kaplan (1982) and further described in (Bresnan, 2001; Dalrymple, 2001). The

main focus of theoretical linguistics research within LFG has been syntax. LFG syntax

consists of two levels of structure.

C-structures

The constituent structure (c-structure) is a representation of the hi-

erarchical grouping of words into phrases. It is used to represent constraints on word

order and constituency; the concept of c-structure corresponds to the notion of context-

free-grammar parse-tree used in formal language theory.

F-structures

The level of functional structure (f-structure) describes the grammat-

ical functions of constituents in sentences, such as subject, direct object, sentential

complement or adjunct. F-structures are more abstract and less variable between lan-

9

Page 24

Attribute

subj

obj

obj2

obl

comp

xcomp

adjunct

Meaning

subject

direct object

indirect object (also objθ)

oblique or prepositional object

sentential complement

non-finite clausal complement

adjunct

Table 2.1: LFG Grammatical functions

guages than c-structures. They can be thought of as providing a syntactic level close

to the semantics or the predicate-argument structure of the sentence. F-structures are

represented in LFG by attribute-value matrices. The attributes are atomic symbols;

their values can be atomic, they can be semantic forms, they can be f-structures, or

they can be sets of f-structures, depending on the attribute. Formally f-structures are

finite functions whose domain is the set of attributes and the codomain is the set of

possible values. Table 2.1 lists the grammatical functions most commonly assumed

within LFG.

Those two levels of syntactic structure are related through the so-called projection

architecture. Nodes in the c-structure are mapped to f-structures via the many-to-one

projection function φ.

Functional equations

An LFG grammar consists of a set of phrase structure rules

and a set of lexical entries, which specify the possible c-structures. Both the phrase

structure rules and the lexical entries are annotated with functional equations, which

specify the mapping φ. The functional equations employ two meta-variables, ↓ and ↑

which refer to the f-structure associated with the current (self) node and the f-structure

associated with its mother node, respectively. The = symbol in the functional equations

is the standard unification operator.

(2.1)

S

−→

NP VP

(↑ subj) = ↓↑ = ↓

10

Page 25

The phrase structure rule in (2.1) is interpreted as follows: the node S has a left

daughter NP and a right daughter VP, the f-structure associated with S unifies with

the f-structure for VP, while the value of the subj attribute of the f-structure for S

unifies with the f-structure associated with the NP.

The notation (f subj) denotes the f-structure f applied to the attribute subj, i.e.

the value of that attribute in f. Function application is left-associative so (f xcomp

subj) is the same as ((f xcomp) subj) and denotes the value of the subj attribute in

the f-structure (f xcomp).

Figure 2.1 shows the c-structure and the f-structure for the English sentence But

stocks kept falling. The nodes in the c-structure are associated with functional equa-

tions. The equations on the phrasal nodes come from the phrase-structure rules; the

ones on the terminals come from lexical entries. The accompanying f-structure is the

minimal f-structure satisfying the set of constraints imposed by this set of equations.

Two of the sub-f-structures are connected with a line; this notation is a shorthand

signifying that the f-structures are identical.

Semantic forms

The values of the pred attribute are so called semantic forms: how-

ever, rather than representing semantics they correspond to subcategorization frames

for lexical items. They encode the number and the grammatical function of the syn-

tactic arguments the lexical item requires. For example ‘fall?subj?’1means that fall

needs one argument, with the grammatical function subj. Semantic forms are uniquely

instantiated, i.e. they should be understood as having an implicit index: only semantic

forms with an identical index are considered equal. This ensure that semantic forms

corresponding to two distinct occurrences of a lexical item in a sentence cannot be

unified. For example in the f-structure for the sentence:

(2.2) The big fish devoured the little fish.

the two semantic forms ‘fish’1and ‘fish’2are distinct and cannot be unified.

The line connecting two f-structures to signify that they are identical also implies

that the implicit indices in the semantic forms are identical.

1In ‘keep?xcomp?subj’ the subj function is outside the brackets: this notation is used to indicate

that the “raised” subject, which keep shares with its xcomp argument; keep does not impose semantic

selectional restrictions on this raised subject.

11

Page 26

S

↓∈(↑adjunct)

CC

But

(↑pred)=‘but’

(↑subj)=↓

NP

↑=↓

N

stocks

(↑pred)=‘stock’

(↑num)=pl

↑=↓

VP

↑=↓

V

kept

(↑pred)=‘keep?xcomp?subj’

(↑subj)=(↑xcomp subj)

(↑xcomp)=↓

VP

↑=↓

V

falling

(↑pred)= ‘fall?subj?’

adjunct

??

?

‘keep?xcomp?subj’

?

pred

pred

‘but’

??

subj

pred

num

‘stock’

pl

?

pred

xcomp

subj[

‘fall?subj?’

]

?

Figure 2.1: LFG representation of But stocks kept falling

12

Page 27

Well-formedness of f-structures

F-structures have three general well-formedness

conditions imposed on them (following Bresnan and Kaplan (1982)).

Completeness An f-structure is locally complete iff it contains all the governable

grammatical functions that its predicate subcategorizes for. An f-structure is

complete iff all its sub f-structures are locally complete. Governable grammatical

functions correspond to possible types of syntactic arguments and include subj,

obj, obj2, xcomp, comp, obl.

Coherence An f-structure is locally coherent iff all its governable grammatical func-

tions are subcategorized for by its local predicate. An f-structure is coherent iff

all its sub f-structures are locally coherent.

Consistency In a given f-structure an attribute can have only one value.2

Together these constraints ensure that all the subcategorization requirements are

satisfied and that no non-governed grammatical functions occur in an f-structure.

Long-distance dependencies and functional uncertainty

Some phenomena in

natural languages such as topicalization, relative clauses and wh-questions introduce

long distance dependencies. Those are constructions where a constituent can be arbi-

trarily distant from its governing predicate.

(2.3) What1did she never suspect she would have to deal with ?1?

In an LFG analysis of (2.3) the interrogative pronoun what has the grammatical

function focus in the top-level f-structure and at the same time the function obj in

the embedded f-structure corresponding to the prepositional phrase introduced by with

at the end of the sentence. In principle an unbounded number of tensed clauses can

separate the interrogative pronoun from its governing predicate.

In order to express such constraints involving unbounded embeddings, LFG resorts

to functional equations with paths through the f-structures written as regular expres-

sions. Such equations are referred to as functional uncertainty equations. For example

to express the constraint that the value of the focus attribute is equal to the value of

2This constraint follows automatically if we regard f-structures as functions.

13

Page 28

the obj attribute arbitrarily embedded in a number of comps or xcomps one would

write

(f focus) = (f {comp | xcomp}∗obj) .

The vertical bar operator | is indicates the disjunction of two expressions, while the

Kleene star∗operator has the standard meaning of a string of 0 or more of the preceding

expressions.

2.2LFG parsing

In this section I briefly review common approaches to parsing natural language with

LFG grammars and then describe in some detail the wide-coverage treebank-based LFG

acquisition methodology developed at DCU. This will serve as background to my own

work on integrating machine learning techniques within this approach.

Computational implementations of LFG and related formalisms such as Head-

Driven Phrase-Structure Grammar (HPSG) are sometimes described as deep grammars.

This term highlights the fact that computational work within these frameworks aims at

parsing natural language text into information-rich, linguistically plausible representa-

tions which account for complex phenomena such as control/raising and long-distance

dependencies. They provide a level of syntax abstract and rich enough for interfacing

with semantics. Until relatively recently, data-driven methods for processing language,

such as parsers based on Probabilistic Context Free Grammars (PCFG), did not pro-

vide such rich structures but rather more “shallow”, “surfacy” representations such as

basic constituency trees.

The level of f-structures in LFG is intermediate between a basic constituency tree

and a semantic representation. The higher level of abstraction as compared to c-

structures can be useful for applications such as e.g. Question Answering, where we

would like to have access to some approximation of argument structure. Since the f-

structures abstract over surface word order they are more appropriate for this purpose:

e.g. two English sentences differing only in adverb placement will receive the same

f-structure representation even though their c-structures differ. This benefit is even

more pronounced in languages with flexible constituent order, where e.g. core verb

14

Page 29

arguments can appear pre- or postverbally. Additionally, at f-structure level, many

dependencies between predicates and their displaced arguments, such as in questions,

relative clauses or topicalization, are resolved, which further eases the task of matching

similar meanings expressed by means of alternative constructions.

Initial work on parsing with deep grammars was based on hand-writing the gram-

mars and using a parsing engine specialized to the grammatical formalism in question

to process sentences. In the context of LFG, the Pargram project (Butt et al., 2002)

has been developing wide-coverage hand-written grammars for a number of languages,

using the XLE parser and grammar development platform (Maxwell and Kaplan, 1996).

Such grammars have been subsequently coupled with stochastic disambiguation models

trained on annotated treebank data which choose the most likely analysis from among

the ones proposed by the parser (Riezler et al., 2001; Kaplan et al., 2004).

2.2.1 Treebank-based LFG parsing

Hand-written LFG grammars such as those developed for the Pargram project can

offer relatively wide coverage. However, their development takes a large amount of

time dedicated by expert linguists, and the coverage still falls short in comparison to

that of shallower, probabilistic parsers which use treebank grammars.

This bottleneck caused by manual grammar writing has motivated an alternative

approach to deep parsing, inspired by probabilistic treebank-based parsers. The idea is

to exploit a treebank and automatically convert it to a deep-grammar representation.

Most research in this framework has used the English Penn II treebank (Marcus et al.,

1994). In addition to constituency trees this treebank employs a number of extra

devices to provide information necessary for the recovery of predicate-argument-adjunct

relations. The most important ones are traces coindexed with phrase structure nodes,

and function labels indicating grammatical functions and semantic roles for adjuncts.

Early work on converting the Penn treebank to a deep-grammar representation and

using this resource to build a data-driven deep parser was carried out within the Tree

Adjoining Grammar (TAG) formalism (Xia, 1999). Subsequently, similar resources

were developed for other grammar formalisms: LFG (Cahill et al., 2002, 2004), HPSG

(Miyao et al., 2003; Miyao and Tsujii, 2005) and Combinatory Categorial Grammar

15

Page 30

(CCG) (Clark and Hockenmaier, 2002; Clark and Curran, 2004).

DCU LFG parsing architecture

The treebank-based parsing research within the HPSG and CCG frameworks follows

a similar pattern: the original treebank trees are semi-automatically corrected and

modified to make them more compatible with the target linguistic representations.

Then a conversion algorithm is applied to the treebank trees, and produces as a results

a collection of HPSG signs or CCG derivations. This transformed treebank is then

used to extract a grammar and train a stochastic disambiguation model which works

on packed chart representations (feature forests (Miyao and Tsujii, 2002, 2008)) and

chooses the most likely parse from among the ones proposed by a dedicated HPSG or

CCG parser.

The projection architecture of LFG with the two levels of syntactic representation

linked via functional annotations on phrase structure rules facilitates an alternative,

more modular implementation strategy. The parsing process is divided into two steps:

c-structure parsing and f-structure construction.

Treebank annotation

A key component in the DCU LFG parsing architecture is

the LFG annotation algorithm. It is a procedure which walks the c-structure trees

and annotates each node with functional equations. The result is an annotated c-

structure tree such the one depicted in Figure 2.1. Of course the structure of the tree

underdetermines the set of constraints that defines the corresponding f-structure, so the

annotation algorithm uses additional sources of information to produce the equations:

Head table. This table specifies, for each local subtree of depth one, which constituent

is the head daughter. Similar tables are used in treebank-based lexicalized proba-

bilistic parsers, and the annotation algorithm for the English Penn treebank uses

an adapted version of the head table from Magerman (1994).

Function labels. Function labels in the English Penn treebank annotate some nodes

with their grammatical function, and label some adjuncts with semantic roles.

Grammatical function labels are very useful since they can be mapped straight-

forwardly to LFG functional equations.

16

Page 31

Coindexed traces. Traces in the English Penn treebank provide information neces-

sary to recover predicate-argument structure, identify control/raising construc-

tions and resolve long-distance dependencies.

Integrated and pipeline models

There are two alternative approaches to LFG

parsing within the general DCU architecture. The integrated model works as follows.

The original treebank trees are annotated with functional equations. This collection

of annotated trees is used to train a PCFG parser or a lexicalized probabilistic parser

such as (Collins, 1999; Charniak, 2000; Charniak and Johnson, 2005). The functional-

equation-annotated nodes are treated as atomic phrase labels and thus the parser learns

to output trees with such labels. To process new text, the annotated-treebank-trained

model is used to produce a tree. Then the function equations encoded on the labels

are collected and evaluated using a dedicated LFG constraint solver, which produces

the f-structure they define.

The pipeline model takes a more modular approach. The c-structure parsing

model (again using some off-the-shelf data driven parsing engine) is trained on original

treebank trees. When processing a new sentence, it is first parsed into a basic c-

structure tree. The annotation algorithm is run on this tree, and the resulting equations

are again evaluated to obtain an f-structure. The bare c-structure tree does not contain

function labels or traces – the annotation algorithm will still work without those but

may be less accurate. For this reason there is a module which adds function labels to

the c-structure tree.

For both the integrated and pipeline models there is a non-local dependency (NLD)

resolution module which deals with non-local phenomena such as raising/control con-

tructions and long distance dependencies. Figure 2.2 illustrates the complete LFG

parsing architecture in the pipeline version. In the work described in the rest of this

thesis I always assume the pipeline architecture: its modular design makes it easy to

improve specific components in a piecewise fashion, independently of each other. By

breaking up the task it also reduces model size and permits more fine-grained control

over the features used for each component.

17

Page 32

Figure 2.2: Pipeline LFG parsing architecture

Morphological analyzer

The first module in the pipeline is the morphological an-

alyzer. For English, which has a reduced amount of inflectional morphology it is a

simple dictionary which associates word forms with their POS tags and lemmas. POS-

tagging is either integrated in the c-structure parser, or an external POS-tagger may

be used. The lemmas are looked up in the dictionary while running the annotation

algorithm since they are needed to construct the semantic forms. For morphologically

richer languages a more sophisticated morphology module can be beneficial: Chapter

6 describes the development of such a module for use with the LFG parsing pipeline.

C-structure parsing

A c-structure parser can be any data-driven statistical parser

which can be trained on a treebank. This approach allows us to leverage advances

in parsing by using state-of-the-art components such as the parser of Charniak and

Johnson (2005). On the other hand, the use of c-structure parser within a pipeline

means that decisions are taken early, and if this component chooses a wrong tree, this

mistake will not be undone in later processing stages.

Function labeler

C-structures labeled with function labels allow the annotation al-

gorithm to produce more accurate functional equations. For some languages and tree-

banks they are even more important than for English – if c-structures are flat, most

syntactic information resides in the grammatical function labels. Rich function labeling

also reduces the amount of work that needs to be done within the f-structure annota-

tion algorithm, since it can simply exploit the straightforward mapping from function

labels to LFG annotations. For those reasons it is highly desirable to have an accurate

data-driven function label model. Chapter 5 discusses research on developing such a

model for the Spanish treebank and using it for Spanish LFG parsing. Additionally it

introduces a high-performing functional labeler for English, which is also trained and

evaluated on Chinese treebank data.

18

Page 33

NLD module

Prior to the application of the NLD module the LFG parser outputs

so-called proto f-structures. The grammatical functions used to analyze wh-questions,

relatives and topicalization, topic and focus are not resolved, i.e. they are not iden-

tified with f-structures at the level where they fulfill subcategorization requirements of

their governing predicate. The NLD module resolves those dependencies; the module

is described in detail in Cahill et al. (2004). In brief, the possible resolution candi-

dates are generated, and they are ranked according to the product of two scores: the

probability of the subcategorization frame given the lemma, and the probability of the

path through f-structure from the source grammatical function to the target (or a fi-

nite approximation of a functional uncertainty equation), given the source grammatical

function. Both conditional probabilities are obtained from f-structures generated for

treebank trees, via Maximum Likelihood estimates.

2.3 GramLab – Treebank-Based Acquisition of Wide-Coverage

LFG Resources

The approach to LFG grammar acquisition and parsing outlined above has been applied

mainly to the English Penn treebank. The aim of the GramLab project is to attempt

to port this architecture to other languages and treebanks. A successful adaptation of

the method would provide valuable NLP resources for those languages and would also

potentially enable multilingual applications such as cross-language information retrieval

or data-driven transfer-based machine translation.

The challenge is to investigate to what degree the methodology developed for En-

glish will work in the context of languages with possibly quite different characteristics.

The payoff is that we learn how to make language processing more robust in the face

of the variety characterizing human languages. Research withing GramLab has investi-

gated treebank-based LFG parsing for Japanese (Oya and van Genabith, 2007), Chinese

(Burke et al., 2004b; Guo et al., 2007), German (Cahill et al., 2005), French (Schluter

and van Genabith, 2007), Arabic (Al-Raheb et al., 2006) and Spanish (O’Donovan

et al., 2005; Chrupa? la and van Genabith, 2006a,b).

The research described in this thesis has been carried out within the GramLab

19

Page 34

project. I have investigated the issues arising when adapting the DCU approach to

LFG parsing to the Spanish Cast3LB treebank. The insights learned from this work

have informed my research on applying machine-learning methods to develop robust

language-independent morphology and function labeling modules and thus minimiz-

ing the effort that has to be devoted to developing the highly language-specific LFG

annotation algorithms.

20

Page 35

Chapter 3

Machine Learning

3.1 Introduction

In this chapter I provide a brief introduction to the field of supervised machine-learning

and give an overview of several machine-learning algorithms useful in Natural Language

Processing. In Section 3.2 I present four algorithms used for classification: Perceptron,

k-nearest-neighbors, MaxEnt and Support Vector Machine. In Section 3.3 I briefly

discuss approaches to the sequence labeling task.

3.1.1 Supervised learning

In supervised learning the goal is to learn a function

h : X → Y

(3.1)

where x ∈ X are inputs and y ∈ Y are outputs. The input objects are called instances,

or examples, and they can be any kind of object, depending on the particular learning

task: in NLP they could be for example documents to classify, strings of words to tag

with POS-sequences or sentences in the source language to translate into the target

language. Depending on the nature of the output space Y, learning tasks can be

categorized into several types:

• Binary classification: Y = {−1,+1}

• Multiclass classification: Y = {1,...,K} (finite set of labels)

21

Page 36

• Regression: Y = R

• Structured prediction: here the outputs in Y are complex. For example, in a

sequence labeling task such as POS-tagging, Y = {1,...,K}n, i.e. the output is

a sequence of labels of length n equal to the length of the input string.

3.1.2 Feature representation

The prediction is based on the feature function Φ : X → F. The function Φ takes an

input object and extracts features which are useful in predicting the output. Generally

the feature space F most common in machine learning is F = RD, i.e. D-dimensional

real vector space. Specifically in NLP, the features will typically either be binary or

will be symbols rather than numbers; the details depend on the learning algorithm.

Feature binarization

For algorithms which require binary features, we can extract symbolic features from

instances and then binarize the output vectors of symbols. A common way of binarizing

features involves mapping each feature-value pair to a new feature and assigning it 1

if it is active and 0 otherwise. Thus for each original symbolic feature i we create as

many new binary features as the number of possible values for feature i; one of them

is set to 1, while all the others are 0. This gives rise to sparse binary vectors with few

non-zero elements. For a feature vector x of length d the corresponding binary vector

x?is given by:

x?= ([[x1= V11]],[[x1= V12]],...,[[x1= V1n]],...,...,[[xd= Vd1]],...,[[xd= Vdm]]) (3.2)

where the jthelement of the set of possible values for the ithfeature is Vijand where

[[p]] =

1 if p is true

0 otherwise .

In the rest of this chapter I will concentrate on two learning tasks: classification,

that is learning to assign a label from a (small) finite set to examples, and sequence

22

Page 37

labeling, that is assigning sequences of such labels to examples which are typically

strings of words.

3.2Classification

3.2.1Perceptron

One of the simplest classification algorithms is the perceptron (Rosenblatt, 1958). Like

more complex algorithms presented later in this chapter (MaxEnt and SVMs), it is a

linear classifier; i.e. in the case of binary classification, it learns the hyperplane sepa-

rating the positive and the negative examples in a multidimensional feature space. I

describe it here as a basic, easy-to-understand instance of a linear hyperplane-based

classification algorithm.

The separating hyperplane is defined by a weight vector w of size d and the bias

b: the weights w0,w1,...,wd and the bias b correspond to the hyperplane equation

w0x0+ w1x1+ ... + wdxd+ b = 0.

The decision function assigning the example to either the positive or negative class

has the following form:

f(x,w,b) = sign(w · Φ(x) + b) = sign

?

d

?

i=1

wiΦ(x)i+ b

?

(3.3)

That is, if the dot product of the weight vector and the feature vector of example x

(plus bias b) is > 0 the example is classified as positive, if it is < 0 it is classified as

negative.

The learning problem thus consists in learning the parameters (the weights and

the bias) from the set of training examples.The perceptron is an online learning

algorithm, i.e. it processes one training example at a time. Initially the parameters are

set to zero. If the current example is correctly classified by the current parameters then

the algorithm proceeds to the next step. If the example is misclassified, the parameters

are updated so that it is correctly classified. The algorithm iterates over the training

examples until no further updates are necessary. The algorithm eventually converges to

parameter settings that correctly classify the whole training set (if the data is linearly

23

Page 38

Perceptron(x1:N,y1:N,I):

1: w ← 0; b ← 0

2: wa← 0; ba← 0

3: c ← 0

4: for i = 1...I do

5:

for n = 1...N do

6:

if yn(w · Φ(xn) + b) ≤ 0 then

7:

w ← w + ynΦ(xn); b ← b + yn

8:

wa← w + cynΦ(xn); ba← ba+ cyn

9:

c ← c + 1

10: return (w − wa/c,b − ba/c)

Figure 3.1: Averaged Perceptron algorithm

Figure 3.2: Example separating hyperplanes in two dimensions

separable).

A modification to the basic algorithm, the averaged perceptron is able to

achieve better generalization to unseen examples. The final parameters which the al-

gorithm returns are the average of all the hypothesized parameters encountered during

the algorithm run. An efficient implementation of the averaged perceptron algorithm

is shown in Figure 3.1 (Daum´ e III, 2006). It works similarly to the basic version, but in

addition to current parameters (w,b), the averaged parameters (wa,ba) are maintained

(see lines 2 and 8). When an example is incorrectly classified, in line 8 those parameters

are updated, but the update is multiplied by the averaging count c. Finally in line 10,

the algorithm returns (w−wa/c,b−ba/c), which corresponds to the average of all the

values the parameters (w,b) took.

If the data is linearly separable, there are obviously an infinite number of hyper-

planes which will separate the training examples. The solution found by the perceptron

algorithm depends on the order in which the examples are processed. Figure 3.2 shows

three solutions to the problem of separating the positive examples (blank points) from

the negative examples (filled points) found by the averaged perceptron algorithm de-

scribed above. Intuitively, some lines classify better than others: for example the dashed

blue line seems to be a better solution than the solid red line. This intuition can be

24

Page 39

conceptualized as the notion of a margin: we want to find solutions which maximize

the distance between the separating hyperplane and the training examples. It has been

shown that a linear classifier’s generalization error to unseen test data is proportional to

the inverse of the margin (Vapnik, 2006; Freund and Schapire, 1998). The perceptron

algorithm does not maximize margin; an online learning algorithm based on the per-

ceptron idea which does is the Margin Infused Relaxed Algorithm (MIRA) (Crammer

and Singer, 2003). The most common maximum-margin algorithm is Support Vector

Machine discussed in Section 3.2.4.

The version of perceptron presented here can only deal with linear classification.

This limitation can be lifted by using the algorithm in conjunction with the “kernel

trick” (Aizerman et al., 1964); kernels are discussed in Section 3.2.4 in connection with

Support Vector Machines.

3.2.2K-NN

Another simple classification learning method is the k-nearest-neighbors algorithm (Fix

and Hodges, 1951; Cover and Hart, 1967). The idea is to assign to a new example the

class label associated with the majority of instances in its neighborhood. The neigh-

borhood is determined by the distance in the multidimensional feature space induced

by the feature vectors representing the instances. The parameter k specifies how many

nearest instances form the neighborhood.

In the case of real-valued features Euclidean distance is used, i.e. the distance

∆(x,x?) between instances x and x?is:

∆(x,x?) =

?

?

?

?

d

?

i=1

(Φ(x)i− Φ(x?)i)2.

(3.4)

In NLP k-NN is frequently used with symbolic features, which may encode word

forms, characters, morphological features and other non-numeric attributes. In this

case the most basic distance metric is the Hamming distance, also called overlap metric

or L1 metric. It defines the distance between two instances to be the sum of per-feature

distances; for symbolic features the per-feature distance is 0 for an exact match and 1

25

Page 40

for a mismatch.

∆(x,x?) =

?

i

δ(Φ(x)i,Φ(x?)i) (3.5)

where

δ(Φ(x)i,Φ(x?)i) =

0 if Φ(x)i= Φ(x?)i

1 if Φ(x)i?= Φ(x?)i

(3.6)

For a vector with a mixture of symbolic and numeric values, the above definition

of per feature distance is used for symbolic features, while for numeric ones we use the

scaled absolute difference (Daelemans and van den Bosch, 2005):

δ(Φ(x)i,Φ(x?)i) =Φ(x)i− Φ(x?)i

maxi− mini

.

(3.7)

The k-NN algorithm modified to use this distance metric is referred to as IB1 (Aha

et al., 1991). Daelemans and van den Bosch (2005) interpret the k parameter differently

from the traditional meaning: instead of k nearest neighbors they consider neighbors at

k nearest distances. This makes a difference in the case where more than one instance

has the same distance to the test instance.

Feature weighting

It is very common to use the IB1 algorithm with some feature weighting method,

where the per-feature distance is multiplied by the weight of the feature for which it is

computed. That is:

∆(x,x?) =

?

i

wiδ(Φ(x)i,Φ(x?)i)(3.8)

where wiis the weight of the ithfeature. There are many ways to find a good weight

vector w. Daelemans and van den Bosch (2005) describe two entropy-based methods

and a χ2-based method.

Information gain

Information gain is a measure of how much knowing the value of

a certain feature for an example decreases our uncertainty about its class, i.e. it is the

difference in class entropy with and without information about the feature value.

26

Page 41

wi= H(Y ) −

?

v∈Vi

P(v) × H(Y |v) (3.9)

where wiis the weight of the ithfeature, Y is the set of class labels, Viis the set of

possible values for the ithfeature, P(v) is the probability of value v, and class entropy

is H(Y ) = −?

that the feature value is v.1

y∈YP(y)log2P(y), while P(Y |v) is the conditional class entropy given

Gain ratio

Information gain tends to assign excessive weight to features with a large

number of values. For example if each instance in the union of the training set and test

set has a unique value for a certain feature, then knowing the value of this feature gives

us certainty as to the class label for the instances in the training set. However it is

useless for predicting the class of a test instance as there are no training instances with

the same value for this feature. To remedy this bias information gain can be normalized

by the entropy of the feature values, which gives the gain ratio:

wi=H(Y ) −?

v∈ViP(v) × H(Y |v)

H(Vi)

(3.10)

For a feature with a unique value for each instance in the training set, the entropy of

the feature values in the denominator will be maximally high, and will thus give a low

weight for this feature.

Chi-squared

Daelemans and van den Bosch (2005) adapt the χ2-based attribute

selection method proposed by White and Liu (1994) as an alternative to information-

theoretic methods. The following equation defines the χ2statistic for a problem with

k classes and m values for feature F:

χ2=

k

?

i=1

m

?

j=1

Eij− Oij

Eij

(3.11)

where Oijis the observed number of instances with the ithclass label and the jthvalue

of feature F. Eijis the expected number of such instances in case the null hypothesis,

i.e. that the feature F does not predict the class, is true. The expected value is defined

1Numeric values need to be temporarily discretized for this to work.

27

Page 42

as:

Eij=n·jni·

n··

(3.12)

where nijis the frequency count of instances with the ithclass label and the jthvalue

of feature F, and

n·j=

k

?

m

?

i=1

nij

(3.13)

ni·=

j=1

nij

(3.14)

n··=

k

?

i=1

m

?

j=0

nij

(3.15)

i.e. the total number of instances. Daelemans and van den Bosch (2005) propose to

either use the χ2values as feature weights in Equation 3.8, or alternatively to use the

shared variance measure:

SVF=

χ2

F

N × (min(k,m) − 1)

(3.16)

where k is the number of classes, m the number of values for feature F and N the

number of instances.

Distance-weighted class voting

In the basic version of the k-NN algorithm all the instances in the neighborhood are

weighted equally for computing the majority class to be assigned to a new instance.

However, we may want to treat the votes from very close neighbors as more important

than votes from more distant ones. A variety of distance weighting schemes have been

proposed to implement this idea; see (Daelemans and van den Bosch, 2005) for details

and discussion.

The k-NN algorithm, unlike the perceptron, is not a linear classifier, i.e. it does not

depend on the assumption that the data is linearly separable. The basic k-NN and its

various modifications have been referred to as lazy learners or memory-based learners.

During learning little “work” is done by the algorithm: the training instances are simply

stored in memory in some efficient manner. It is during prediction that most of the

28

Page 43

actual computation takes place: the test instance is compared to the training instances,

the neighborhood is calculated, and the majority label assigned. In k-NN no abstraction

is performed, the model generalizes based on directly comparing the test instance with

labeled training examples. No information is discarded, all the “exceptional” and low

frequency items are still available for informing the prediction.

The k-NN algorithm is one of the Machine Learning methods used for function

labeling experiments for Spanish in Section 5.2.

3.2.3 Logistic Regression and MaxEnt

Maximum Entropy (or MaxEnt) models are linear probabilistic classifiers commonly

used in NLP. In multiclass classification they output probability distribution over class

labels. MaxEnt models correspond to logistic regression models, but are derived in

an alternative way. In this section I introduce linear and logistic regression and then

present the MaxEnt classifier as typically used in NLP.

Linear regression

In linear regression models the prediction function h introduced in Equation 3.1 is

instantiated as h : X → R, i.e. we try to build models which predict outcomes in the

set of real numbers based on example objects, or observations, in X. The prediction is

based on the features, or predictors, which are also typically real numbers. The feature

function Φ maps observations to vectors of predictors, i.e. Φ : X → Rd. The model is

defined by the equation:

?

where y is the outcome, Φ(x)1..Φ(x)dare the feature values, w1..wdare the feature

y = w0+

d

i=1

wiΦ(x)i

(3.17)

weights, and w0is the intercept (or bias). We can eliminate w0by adding a special

Φ(x)0feature which is always set to 1, and reduce the above equation to the dot product

between the weight vector and the feature vector:

y = w · Φ(x)(3.18)

29

Page 44

Note the similarity to Equation 3.3 for the linear classifier: in the case of binary clas-

sification we use the sign of the dot product to assign the object to the class; for linear

regression the value of the dot product is the outcome we are predicting.

In order to learn a linear regression model we minimize the sum squared error over

the training set of M examples:

cost(w) =

M

?

j=0

(w · Φ(x(j)) − y(j)

obs)2

(3.19)

There is a closed-form formula for choosing the best weights w, given by:

w = (XTX)−1XTy

(3.20)

where the matrix X contains training example features, and y is the vector of outcomes.

Logistic regression

In logistic regression we use the linear model to perform classification, i.e. assign prob-

abilities to class labels. For binary classification we want to predict the probability

of the instance being in the positive class given the instance: p(y = true|x). But the

predictions of a linear regression model are real numbers y ∈ R, whereas probabilities

range between 0 and 1: p(y = true|x) ∈ [0,1]. To ensure that the response is in the

valid range we can instead predict the logit function of the probability:

ln

?

p(y = true|x)

1 − p(y = true|x)

p(y = true|x)

1 − p(y = true|x)= ew·Φ(x)

?

= w · Φ(x) (3.21)

(3.22)

Solving for p(y = true|x) we obtain:

p(y = true|x) =

ew·Φ(x)

1 + ew·Φ(x)

exp

(3.23)

=

??d

i=0wiΦ(x)i

??d

?

1 + exp

i=0wiΦ(x)i

?

(3.24)

30

Page 45

In order to learn a logistic regression model we use conditional likelihood estimation.

We choose the weights which make the probability of the observed outcomes (y) be the

highest, given the observations (x). For a training set with N examples:

ˆ w = argmax

w

N

?

i=0

pw(y(i)|x(i)) (3.25)

There is no close-form solution to this equation. It is a problem in convex optimization;

several special purpose and generic solutions are available to train those models, e.g.

• L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno method)

• gradient ascent

• conjugate gradient

• iterative scaling algorithms

Maximum Entropy Models

Logistic regression with more than two classes is referred to as multinomial logistic

regression, and also known as Maximum Entropy (MaxEnt). The MaxEnt equation

generalizes Equation 3.24 above:

p(y|x) =

exp

??d

i=0wiΦ(x,y)i

??d

?

?

y?∈Yexp

i=0wiΦ(x,y?)i

?

(3.26)

The denominator is the normalization factor usually called Z used to make the score

into a proper probability distribution:

p(y|x) =1

Zexp

d

?

i=0

wiΦ(x,y)i

(3.27)

Indicator features

Note that in the above the feature function Φ is parameterized

for the class label y. In MaxEnt modeling typically binary indicator features are used,

which depend on the class label. Thus Φ(x,y)i∈ {0,1}. For example, in the case of

part of speech tagging, if the object x is the word w0in the surrounding context, and

31

Page 46

class label y = VBG, then an example feature might be:

Φ(x,y)1=

1 if suffix3(w0) = ing ∧ y = VBG

otherwise0

The model weight for this feature will indicate how strong a predictor the suffix “ing”

is for the label VBG.

Maximum Entropy and Maximum Likelihood

The name Maximum Entropy

comes from the fact that solving the optimization problem for finding the multinomial

logistic regression model whose weights maximize the likelihood of the training data is

equivalent to finding the probability distribution p∗ with maximum entropy among the

set of distributions C which are consistent with the constraints imposed by the features

and the training data:

p∗ = argmax

p∈C

H(p) (3.28)

where the entropy of the distribution of discrete random variable X is given by:

H(X) = −

?

x

P(X = x)log2P(X = x) (3.29)

This duality was demonstrated by Berger et al. (1996). Maximizing the entropy subject

to some constraints is motivated by the well-known Occam’s razor principle: our model

should be as simple as possible while still predicting the data; in this case “simple”

is interpreted as maximally uniform, since the uniform distribution has the highest

entropy. The constraints imposed on the probability model are encoded in the features:

the expected value of each one of I indicator features fiunder a model p should be equal

to the expected value under the empirical distribution ˜ p obtained from the training

data:

∀i ∈ I, Ep[fi] = E˜ p[fi] (3.30)

32

Page 47

where fi(x,y) = Φ(x,y)i. The expected value under the empirical distribution is given

by:

E˜ p[fi] =

?

x

?

y

˜ p(x,y)fi(x,y) =1

N

N

?

j

fi(xj,yj) (3.31)

The expected value according to model p is:

Ep[fi] =

?

x

?

y

p(x,y)fi(x,y) (3.32)

However, this requires summing over all possible object - class label pairs, which is in

general not possible. Therefore the following standard approximation is used (Rosen-

feld, 1996):

Ep[fi] =

?

x

?

y

˜ p(x)p(y|x)fi(x,y) =1

N

N

?

j

?

y

p(y|xj)fi(xj,y) (3.33)

where ˜ p(x) is the relative frequency of object x in the training data; this has the

advantage that ˜ p(x) for unseen events is 0. The term p(y|x) is calculated according to

Equation 3.26.

Regularization

Although the Maximum Entropy principle used in MaxEnt modeling

ensures that the models are maximally uniform subject to the constraints, they can still

overfit the training data, resulting in poor generalization to unseen instances. There

is a technique called regularization which results in relaxing the requirement that the

constraints be satisfied exactly and results in models with smaller weights which may

perform better on new data. Instead of solving the optimization in Equation 3.25,

repeated here in log-space form:

ˆ w = argmax

w

M

?

i=0

logpw(y(i)|x(i)),

(3.34)

we solve instead the following modified problem:

ˆ w = argmax

w

M

?

i=0

logpw(y(i)|x(i)) + αR(w)(3.35)

33

Page 48

where R is the regularizer used to penalize large weights (Jurafsky and Martin, 2008).

We can use a regularizer which assumes that weight values have a Gaussian distribution

centered on 0 and with variance σ2(Chen and Rosenfeld, 1999). By multiplying each

weight by a Gaussian prior we will maximize the following equation:

ˆ w = argmax

w

M

?

i=0

logpw(y(i)|x(i)) −

d

?

j=0

w2

2σ2

j

j

(3.36)

where σ2

jare the variances of the Gaussians of feature weights. This modification

corresponds to using a maximum a posteriori rather than maximum likelihood model

estimation. In practice it is common to constrain all the weights to have the same

global variance, which gives a single tunable algorithm parameter, whose optimal value

can be found on held-out data or by cross-validation.

The MaxEnt algorithm is one of the Machine Learning methods used for function

labeling experiments for Spanish in Section 5.2.I also employ Maximum Entropy

models for the Morfette morphological analysis system described in Sections 6.4 and

6.5.

3.2.4Support Vector Machines

Support Vector Machine (SVM) is a machine learning algorithm which exploits two

two key ideas: large-margin classification, and the “kernel trick”.

Large margin

The idea of large margin classification, mentioned briefly in Section 3.2.1, is both intu-

itively appealing and theoretically motivated. Intuitively, it makes sense for the decision

boundary to be as far away from the training instances as possible: this improves the

chance that if the position of the data points is slightly perturbed, the decision boundary

will still be correct. Results from Statistical Learning Theory confirm these intuitions:

maintaining large margins leads to small generalization error (Vapnik, 1995).

Formally, the functional margin of an instance (x,y) with respect to some hyper-

plane (w,b) is defined to be

γ = y(w · Φ(x) + b) (3.37)

34

Page 49

Figure 3.3: Separating hyperplane and support vectors

Some data points will have the minimum functional margin: the functional margin

of the whole data set with respect to the hyperplane is then twice that quantity.

However, the functional margin can be made larger just by rescaling the weights by

some constant: (λw,λb) without changing the associated hyperplane. Hence we can

fix the functional margin to be 1 and minimize the norm of the weight vector (which

is equivalent to maximizing the geometric margin).

This results in the following quadratic programming optimization formulation of

the SVM learner: For linearly separable training instances ((x1,y1),...,(xn,yn)) find

the hyperplane (w,b) that solves the optimization problem:

minimizew,b

1

2||w||2

yi(w · Φ(xi) + b) ≥ 1 subject to

∀i∈1..n

(3.38)

This hyperplane separates the examples with geometric margin 2/||w||

Since SVM finds a separating hyperplane with the largest margin to the nearest

instance, this has the effect of the decision boundary being fully determined by a small

subset of the training examples, namely the nearest ones on both sides. Those instances

are the support vectors which SVM is named after.

Figure 3.3 shows the same data points as Figure 3.2. The solid line is the separating

hyperplane with the maximum margin with respect to the training data; the points on

the dotted line are support vectors.

Soft margin

For datasets which are not linearly separable there will be no hyperplane

satisfying the constraints. To deal with such cases the version of SVM with soft margin

has been proposed. It works by relaxing the requirement that all data points lie outside

the margin, and introduces a penalty term which measures how much this requirement

is violated. For each offending instance there is a “slack variable” ξiwhich measures

how much it would have to be moved to make it obey the margin constraint. This leads

35

Page 50

to the following modified formulation:

minimizew,b

1

2||w||2+ C

n

?

i=1

ξi

subject to

yi(w · Φ(xi) + b) ≥ 1 − ξi

∀i∈1..nξi> 0

(3.39)

where

ξi= max(0,1 − yi(w · Φ(xi) + b))

The hyper-parameter C is the cost of margin constraint violation, used to trade off

minimizing the norm of the weight vector versus classifying correctly as many examples

as possible. As the value of C tends towards infinity the soft-margin SVM approximates

the hard-margin version.

Kernel-induced feature spaces

As presented so far the SVM algorithm finds decision boundaries only for linearly

separable data. This limitation can be removed by exploiting the “kernel trick”. The

kernel technique depends on the fact that for some linear classification algorithms,

including SVM, there exist dual formulations, where the weight vector can be expressed

as a linear combination of training examples, and the algorithm only involves computing

dot products between the test instance and training instances.

Dual form

The dual formulation of the optimal hyperplane for SVM is in terms of

support vectors, where SV is the set of indices of support vectors:

f(x,α∗,b∗) = sign

??

i∈SV

yiα∗

i(Φ(xi) · Φ(x)) + b∗

?

(3.40)

The weights in this decision function are the Lagrange multipliers α∗. Points which

are not in the support vector set have no influence on the final decision. The dual

36