Building a Collocation Net.
-
Citations (0)
-
Cited In (0)
Page 1
International Journal of Computer Processing of Oriental Languages
Vol. 20, Nos. 2 & 3 (2007) 1–16
© Chinese Language Computer Society &
World Scientific Publishing Co.
1
1st Reading
Building a Collocation Neta
ZHOU GUODONG*,†, ZHANG MIN†, LI JUNHUI* AND ZHU QIAOMING*
*School of Computer Science and Technology,
Soochow University, 1 Shizi Street, Suzhou, China 215006
{gdzhou, jhli, qmzhu}@suda.edu.cn
†Institute for Infocomm Research,
21 Heng Mui Keng Terrace, Singapore 119613
{zhougd, mzhang}@i2r.a-star.edu.sg
This paper presents an approach to build a novel two-level collocation net, which
enables calculation of the collocation relationship between any two words, from
a large raw corpus. The first level consists of atomic classes (each atomic class
consists of one word and feature bigram), which are clustered into the second
level class set. Each class in both levels is represented by its collocation candidate
distribution, extracted from the linguistic analysis of the raw training corpus,
over possible collocation relation types. In this way, all the information extracted
from the linguistic analysis is kept in the collocation net. Our approach applies
to both frequently and less-frequently occurring words by providing a clustering
mechanism and resolve the data sparseness problem through the collocation net.
Experimentation shows that the collocation net is efficient and effective in solving
the data sparseness problem and determining the collocation relationship between
any two words.
Keywords: Collocation net; Data sparseness problem; Clustering.
1. Introduction
In any natural language, there always exist many highly associated relationships
between words. The two words “strong” and “powerful” are perhaps the canonical
example. Although “strong” and “powerful” have similar syntax and semantics,
there exist contexts where one is much more appropriate than the other [6].
aPart of the work was done when the author was at the Institute for Infocomm Research,
Singapore.
00166.indd 111/29/2007 2:52:45 PM
Page 2
2 Guodong Zhou et al.
1st Reading
For example, we always say “strong tea” instead of “strong computer” and
“powerful computer” instead of “powerful tea”. Psychological experiments
[11] also indicated that human’s reaction to a highly associated word pair was
stronger and faster than that to a poorly associated one. Lexicographers use
the terms “collocation” and “co-occurrence” to describe various constraints on
pairs of words. Here, we restrict “collocation” in the narrower sense between
grammatically bound words, e.g. “strong” and “tea”, which occur in a particular
grammatical order, and “co-occurrence” for the more general phenomenon of
relationships between words, e.g. “doctor” and “nurse”, which are likely to be
used in the same context [10].
This paper will concentrate on “collocation” rather than “co-occurrence”
although there is much overlap between these two terms. There is also considerable
overlap between the concept of “collocation” and notions like “term”, “technical
term” and “terminological phrase”. The latter three are commonly used when
collocations are extracted from technical domain. However, it should be noted
that the word “term” has a different meaning in information retrieval, where it
refers to words and phrases.
There are more and more interest in collocations and co-occurrences partly
because this area has been undervalued in the structural linguistic traditions that
follow Saussure and Chomsky. Structural linguistics concentrates on the general
abstractions about properties of phrases and sentences. In contrast, Contextual
Theory of Meaning that follows Firth, Halliday and Sinclair, emphasizes the
importance of context: the context of social setting, the context of discourse
and the context of surrounding words. Such detailed contextual information
easily gets lost in structural linguistics. This paper follows Firth’s Contextual
Theory of Meaning to discover the collocations, which are grammatically bound.
Collocations are important for a number of applications: natural language
generation, computational lexicography, parsing, proper noun discovery, corpus
linguistic research, machine translation, information retrieval, etc. As an example,
[7] showed how syntactic related collocation statistics can be used to improve the
performance of the parser on sentences such as “She wanted/placed/put the dress
on the rack.”, where lexical preferences are crucial to resolving the ambiguity of
prepositional phrase attachment. It also showed that a parser can enforce these
preferences by comparing the statistical association of the syntactic relation verb-
preposition (“want…on”) with the statistical association of the syntactic relation
object-preposition (“dress…on”), when attaching the prepositional phrase.
Currently, there are two categories of approaches used to discover collocations
and co-occurrences: statistics-based and parsing-based.
00166.indd 211/29/2007 2:52:45 PM
Page 3
Building a Collocation Net 3
1st Reading
1.1. Statistics-based methods
The methods in this category are normally used to extract the word co-occurrence
relationship, the phenomena where words are likely to occur in the same context,
from the raw unparsed corpus. Different criteria are used to determine the
word co-occurrences. First of all, frequency-based method [12, 8, 18] uses the
frequencies of the word pairs with the optional help of part-of-speech filter, stop
word list and/or acceptable patterns. Secondly, mean and variance-based method
[14] computes the mean and variance of the offsets between the words in the
corpus, and the word pairs, which have low variances, are regarded as word
co-occurrences. Thirdly, hypothesis testing-based methods are used to determine
whether two words occur in the same context more than chance. For example,
t-test [1, 3] assumes normal distribution and looks at the difference between
the observed and expected means, scaled by the variance of the sample data.
Chi-square test [2, 15] uses n-by-n table to show the dependence of occurrences
between words and compares the observed frequencies in the table with the
expected frequencies for independence. Likelihood ratio [5] assumes binomial
distribution and tells how more likely the independence hypothesis is than the
dependence hypothesis. Fourthly, mutual information-based method [13, 17, 19,
20] tells the change of information when two words co-occur.
However, there exist several problems with the statistics-based methods:
• These methods cannot differentiate between different types of linguistic relations
and the extracted co-occurrences may not be grammatically bound.
• These methods are only effective on frequently occurred words and not effective
on less frequently occurred words because they provide no mechanism for
categorization to resolve the problem of sparseness.
• The co-occurrences extracted by the frequency-based method and the variance-
based method may be negative. The scores used by t-test and chi-square are
difficult to interpret while mutual information-based method is the worst for
the low frequently occurred words.
• The extracted co-occurrences are always stored in a dictionary, which
only contains a limited number of entries and very limited information for
each one.
1.2. Parsing-based Methods
The parsing-based methods rely on the syntactic analysis. These methods
can extract linguistic related word collocations from the parsed trees and can
differentiate between different types of linguistic relations. Normally these
00166.indd 311/29/2007 2:52:45 PM
Page 4
4 Guodong Zhou et al.
1st Reading
methods are combined with the frequency-based method to reject the ones whose
frequencies are below the predefined threshold [16].
However, there also exist several problems with the parsing-based
methods:
• These methods only apply to frequently occurred words and are not effective
on less frequently occurred words because they provide no mechanism for
categorization to resolve the problem of data sparseness.
• Manual intervention may be required to ensure that the extracted collocations
are valid especially when other statistics-based methods, e.g. the frequency-
based method, is not used to reject the invalid ones.
• Similar to the co-occurrences extracted using the statistics-based methods, the
collocations extracted using the parsing-based methods may be negative and
are always stored in a dictionary, which only contains a limited number of
entries and very limited information for each one.
Generally, both the statistics and parsing-based approaches are only effective
on frequently occurring words and not effective on less frequently occurring
words due to the data sparseness problem. Moreover, the extracted collocations
or co-occurrences are always stored in a dictionary, which only contains a limited
number of entries with limited information for each one. Finally, the collocation
dictionary normally does not differentiate the strength of various collocations.
This paper combines the parsing-based approach and the statistics-based
approach, and proposes a novel structure of collocation net. Through the
collocation net, the data sparseness problem is resolved by providing a clustering
mechanism and the collocation relationship between any two words can be
easily determined and measured from the collocation net. Here, the collocation
relationship is calculated using novel estimated pair-wise mutual information
(EPMI) and estimated average mutual information (EAMI). Moreover, all the
information extracted from the linguistic analysis is kept in the collocation net.
Compared with the traditional collocation dictionary, the collocation net provides
a much more powerful facility since it can determine and measure the collocation
relationship between any two words quantitatively.
The layout of this paper is as follows: Section 2 describes the novel structure
of collocation net. Section 3 describes estimated pair-wise mutual information
(EPMI) and estimated average mutual information (EAMI) to determine and
measure the collocation relationship between any two words while Section 4
presents a method for automatically building a collocation net given a large
law corpus. Experimentation is given in Section 5. Finally, some conclusions
are drawn in Section 6.
00166.indd 411/29/2007 2:52:45 PM
Page 5
Building a Collocation Net 5
1st Reading
2. Collocation Net
The collocation net is a kind of two-level structure, which stores rich information
about the collocation candidates and others extracted from the linguistic analysis
of a large raw corpus. The first level consists of word and feature bigramsb while
the second level consists of classes that are clustered from the word and feature
bigrams in the first level. For convenience, each word and feature bigram in the
first level is also regarded as a class (atomic class). That is to say, each first level
atomic class contains only one bigram while each second level class contains one
or more word and feature bigrams clustered from first level atomic classes.
Meanwhile, each class in both levels of the collocation net is represented
by its related collocation candidate distribution, extracted from the linguistic
analysis. In this paper, a collocation candidate is represented as a 3tuple: a left
side, a right side and a collocation relation type, which represents the collocation
relationship between the left side and the right side. Both the left and right
sides can be either a word and feature bigram or a class of word and feature
bigrams. For example, a collocation candidate can be either wfi − CRk − wfj or
Chi − CRk − Cgi, where wfi is a word and feature bigram; Chi is the ith class in
the hth level and CRk is a relation type.
Briefly, the collocation net is defined as follows:
CoNET = {wf, CR, L1, L2, Ph->g} (1)
• wf stores possible word and feature bigrams extracted from the syntactic
analysis:
{,1},
i
wf wfiwf
=≤ ≤
where wfi is the ith word and feature pair in
wf and wf is the number of the word and feature pairs in wf.
• CR stores possible collocation relation types returned from the syntactic
analysis:
{,1},
i
CRCRiCR
=≤ ≤
where CRi the ith linguistic relation in CR
and
,1},i CR
≤ ≤
is the number of the collocation relation types in CR.
• L1 and L2 are the first and second levels in the collocation net,
respectively;
{
i
CRCR
=
{,}
hi
hhiC
LCFDCC
= 〈〉 (2)
where
Chi is the ith class in Ch;
(1
hi
C
FDCC
≤ ≤
{,1}
hhih
CCiC
=≤ ≤
is the class set in Lh (Obviously, C1 = wf );
C is the number of the classes in Ch and
is the frequency distribution of collocation candidates
h
)
h
iC
bThe reason to use the word and feature bigram is to distinguish the same word with different
features, which can be “word sense”, “part-of-speech”, etc. In this paper, “part-of-speech” is used
as the feature.
00166.indd 511/29/2007 2:52:45 PM