ArticlePDF Available

Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information

Authors:

Abstract and Figures

We present a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, then a set of cooccurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Co-occurrence distribution shows importance of a term in the document as follows. If probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword. The degree of biases of distribution is measured by the chi^2-measure. Our algorithm shows comparable performance to tfidf without using a corpus.
Content may be subject to copyright.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
International Journal on Artificial Intelligence Tools
Vol. 13, No. 1 (2004) 157–169
c
World Scientific Publishing Company
KEYWORD EXTRACTION FROM A SINGLE DOCUMENT USING
WORD CO-OCCURRENCE STATISTICAL INFORMATION
Y. MATSUO
National Institute of Advanced Industrial Science and Technology
y.matsuo@aist.go.jp
M. ISHIZUKA
University of Tokyo
ishizuka@miv.t.u-tokyo.ac.jp
Received 18 July 2003
Revised 19 October 2003
Accepted 19 October 2003
We present a new keyword extraction algorithm that applies to a single document with-
out using a corpus. Frequent terms are extracted first, then a set of co-occurrences
between each term and the frequent terms, i.e., occurrences in the same sentences, is
generated. Co-occurrence distribution shows importance of a term in the document as
follows. If the probability distribution of co-occurrence between term a and the frequent
terms is biased to a particular subset of frequent terms, then term a is likely to be
a keyword. The degree of bias of a distribution is measured by the χ
2
-measure. Our
algorithm shows comparable performance to tfidf without using a corpus.
Keywords: Keyword extraction; co-occurrence; χ
2
-measure.
1. Introduction
Keyword extraction is an important technique for document retrieval, Web page re-
trieval, document clustering, summarization, text mining, and so on. By extracting
appropriate keywords, we can easily choose which document to read to learn the
relationship among documents. A popular algorithm for indexing is the tfidf mea-
sure, which extracts keywords that appear frequently in a document, but that don’t
appear frequently in the remainder of the corpus. The term “keyword extraction”
is used in the context of text mining, for example
15
. A comparable research topic is
called “automatic term recognition” in the context of computational linguistics and
“automatic indexing” or “automatic keyword extraction” in information retrieval
research.
Recently, numerous documents have been made available electronically. Domain-
independent keyword extraction, which does not require a large corpus, has many
applications. For example, if one encounters a new Web page, one might like to know
157
February 27, 2004 11:40 WSPC/109-IJAIT 00146
158 Y. Matsuo & M. Ishizuka
the contents quickly by some means, e.g., by having the keywords highlighted. If
one wants to know the main assertion of a paper, one would want to have some
keywords. In these cases, keyword extraction without a corpus of the same kind
of documents is very useful. Word count
8
is sometimes sufficient for document
overview; however, a more powerful tool is desirable.
This paper explains a keyword extraction algorithm based solely on a single
document. First, frequent terms are extracted. Co-occurrences of a term and fre-
quent terms are counted. If a term appears frequently with a particular subset of
terms, the term is likely to have important meaning. The degree of bias of the co-
occurrence distribution is measured by the χ
2
-measure. We show that our keyword
extraction performs well without the need for a corpus. In this paper, a term is
defined as a word or a word sequence. We do not intend to limit the meaning in a
terminological sense. A word sequence is written as a phrase.
This paper is organized as follows. The next section describes our idea of key-
word extraction. We describe the algorithm in detail followed by evaluation and
discussion. Finally, we summarize our contributions.
2. Term Co-occurrence and Importance
A document consists of sentences. In this paper, a sentence is considered to be a set
of words separated by a stop mark (“.”, “?” or “!”). We also include document titles,
section titles, and captions as sentences. Two terms in a sentence are considered to
co-occur once. That is, we see each sentence as a “basket,” ignoring term order and
grammatical information except when extracting word sequences.
We can obtain frequent terms by counting term frequencies. Let us take a very
famous paper by Alan Turing
20
as an example. Table 1 shows the top ten frequent
terms and the probability of occurrence, normalized so that the sum is to be 1
(i.e., normalized relative frequency). Next, a co-occurrence matrix is obtained by
counting frequencies of pairwise term co-occurrences, as shown in Table 2. For
example, term a and term b co-occur in 30 sentences in the document. Let N denote
the number of different terms in the document. While the term co-occurrence matrix
is an N × N symmetric matrix, Table 2 shows only a part of the whole – an N × 10
matrix. We do not define diagonal components.
Assuming that term w appears independently from frequent terms (denoted as
G), the distribution of co-occurrence of term w and the frequent terms is similar to
the unconditional distribution of occurrence of the frequent terms shown in Table 1.
Table 1. Frequency and probability distribution.
Frequent term a b c d e f g h i j Total
Frequency 203 63 44 44 39 36 35 33 30 28 555
Probability 0.366 0.114 0.079 0.079 0.070 0.065 0.063 0.059 0.054 0.050 1.0
a
machine,
b
computer,
c
question,
d
digital,
e
answer,
f
game,
g
argument,
h
make,
i
state,
j
number
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 159
Table 2. A co-occurrence matrix.
a b c d e f g h i j Total
a 30 26 19 18 12 12 17 22 9 165
b
30 5 50 6 11 1 3 2 3 111
c
26 5 4 23 7 0 2 0 0 67
d
19 50 4 3 7 1 1 0 4 89
e
18 6 23 3 7 1 2 1 0 61
f 12 11 7 7 7 2 4 0 0 50
g
12 1 0 1 1 2 5 1 0 23
h
17 3 2 1 2 4 5 0 0 34
i
22 2 0 0 1 0 1 0 7 33
j
9 3 0 4 0 0 0 0 7 23
· · ·
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
u
6 5 5 3 3 18 2 2 1 0 45
v
13 40 4 35 3 6 1 0 0 2 104
w
11 2 2 1 1 0 1 4 0 0 22
x
17 3 2 1 2 4 5 0 0 0 34
u: imitation, v: digital computer, w: kind, x: make
Fig. 1. Co-occurrence probability distribution of the terms “kind”, “make”, and frequent terms.
Conversely, if term w has a semantic relation with a particular set of terms g G,
co-occurrence of term w and g is greater than expected, the distribution is said to
be biased.
Figures 1 and 2 show the co-occurrence probability distribution of some terms
and frequent terms. In the figures, unconditional distribution of frequent terms is
shown as “unconditional”. A general term such as ‘kind” or “make” is used relatively
impartially with each frequent term, while a term such as “imitation” or “digital
computer” shows co-occurrence with particular terms. These biases are derived
from either semantic, lexical, or other relationships between two terms. Thus, a
February 27, 2004 11:40 WSPC/109-IJAIT 00146
160 Y. Matsuo & M. Ishizuka
Fig. 2. Co-occurrence probability distribution of the terms “imitation”, “digital computer”, and
frequent terms.
term with co-occurrence biases may have an important meaning in a document. In
this example, “imitation” and “digital computer” are important terms, as we all
know: In this paper, Turing proposed an “imitation game” to replace the question
“Can machines think?”
Therefore, the degree of biase of co-occurrence can be used as a indicator of
term importance. However, if term frequency is small, the degree of biases is not
reliable. For example, assume term w
1
appears only once and co-occurs only with
term a once (probability 1.0). At the other extreme, assume term w
2
appears 100
times and co-occurs only with term a 100 times (with probability 1.0). Intuitively,
w
2
seems more reliably biased. In order to evaluate statistical significance of biases,
we use the χ
2
test, which is very common for evaluating biases between expected
frequencies and observed frequencies. For each term, frequency of co-occurrence
with the frequent terms is regarded as a sample value; the null hypothesis is that
“occurrence of frequent terms G is independent from occurrence of term w,” which
we expect to reject.
We denote the unconditional probability of a frequent term g G as the ex-
pected probability p
g
and the total number of co-occurrences of term w and fre-
quent terms G as n
w
. Frequency of co-occurrence of term w and term g is written
as freq(w, g). The statistical value of χ
2
is defined as
χ
2
(w) =
X
gG
(freq(w, g) n
w
p
g
)
2
n
w
p
g
. (1)
If χ
2
(w) > χ
2
α
, the null hypothesis is rejected with significance level α. The term
n
w
p
g
represents the expected frequency of co-occurrence; and (freq(w, g) n
w
p
g
)
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 161
Table 3. Terms with high χ
2
value.
Rank χ
2
Term Frequency
1 593.7 digital computer 31
2
179.3 imitation game 16
3
163.1 future 4
4
161.3 question 44
5
152.8 internal 3
6
143.5 answer 39
7
142.8 input signal 3
8
137.7 moment 2
9
130.7 play 8
10
123.0 output 15
.
.
.
.
.
.
.
.
.
.
.
.
553
0.8 Mr. 2
554
0.8 sympathetic 2
555 0.7 leg 2
556
0.7 chess 2
557
0.6 Pickwick 2
558
0.6 scan 2
559
0.3 worse 2
560
0.1 eye 2
(We set the top ten frequent terms as G.)
represents the difference between observed and expected frequencies. Therefore,
large χ
2
(w) indicates that co-occurrence of term w shows strong bias. In this paper,
we use the χ
2
-measure as an index of biases, not for tests of hypotheses.
Table 3 shows terms with high χ
2
values and ones with low χ
2
values in Turing’s
paper. Generally, terms with large χ
2
are relatively important in the document;
terms with small χ
2
are relatively trivial. The table excludes terms whose frequency
is less than two. However, we don’t have to define such a threshold, because low
frequency usually indicates low χ
2
value (unless n
w
p
g
is very large, which is quite
unusual.)
In summary, our algorithm first extracts frequent terms as a “standard”; then
it extracts terms with high deviation from the standard as keywords.
3. Algorithm Description and Improvement
In the previous section, the basic idea of our algorithm is described. This section
gives the precise algorithm description and two algorithm improvements: calculation
of χ
2
value and clustering of terms. These improvements lead to better performance.
3.1. Calculation of χ
2
values
To improve the calculation of the χ
2
value, we focus on two aspects: variety of
sentence length and robustness of the χ
2
value.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
162 Y. Matsuo & M. Ishizuka
First, we consider the length of sentences. A document consists of sentences of
various lengths. If a term appears in a long sentence, it is likely to co-occur with
many terms; if a term appears in a short sentence, it is less likely to co-occur with
other terms. We consider the length of each sentence and revise our definitions. We
denote
p
g
as (the sum of the total number of terms in sentences where g appears)
divided by (the total number of terms in the document),
n
w
as the total number of terms in sentences where w appears.
Again n
w
p
g
represents the expected frequency of co-occurrence. However, its value
becomes more precise.
a
Second, we consider the robustness of the χ
2
value. A term co-occurring with
a particular term g G has a high χ
2
value. However, these terms are sometimes
adjuncts of term g and not important terms. For example, in Table 3, a term
“future” or “internal” co-occurs selectively with the frequent term “state,” because
these terms are used in the form of “future state” and “internal state.” Though
χ
2
values for these terms are high, “future” and “internal” themselves are not
important. Assuming that “state” is not a frequent term, χ
2
values of these terms
diminish rapidly.
We use the following function to measure robustness of bias values; it subtracts
the maximal term from the χ
2
value,
χ
02
(w) = χ
2
(w) max
gG
(freq(w, g) n
w
p
g
)
2
n
w
p
g
. (2)
Using this function, we can estimate χ
02
(w) as low if w co-occurs selectively with
only one term. It will have a high value if w co-occurs selectively with more than
one term.
3.2. Clustering of terms
Some terms co-occur with each other and clusters of terms are obtained by combin-
ing co-occurring terms. Below we show how to calculate the χ
2
value more reliably
by clustering terms.
A co-occurrence matrix is originally an N × N matrix, where columns corre-
sponding to frequent terms are extracted for calculation. We ignore the remaining
columns, i.e., co-occurrence with low frequency terms, because it is difficult to es-
timate precise probability of occurrence for low frequency terms.
To improve extracted keyword quality, it is very important to select the proper
set of columns from a co-occurrence matrix. The set of columns is preferably or-
thogonal; assuming that terms g
1
and g
2
appear together very often, co-occurrence
a
p
g
is the probability of a term in a document to co-occur with g. Each term can co-occur with
multiple terms. Therefore, the sum of p
g
for all terms is not 1.0 but the average number of frequent
terms in a sentence.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 163
Table 4. Two transposed columns.
a b c d e f g h i j . . .
c 26 5 4 23 7 0 2 0 0 . . .
e
18 6 23 3 7 1 2 1 0 . . .
Table 5. Clustering of the top 49 frequent terms.
C1: game, imitation, imitation game, play, programme
C2: system, rules, result, important
C3: computer, digital, digital computer
C4: behaviour, random, law
C5: capacity, storage
C6: question, answer
· · · · · ·
C26: human
C27: state
C28: learn
of terms w and g
1
might imply the co-occurrence of w and g
2
. Thus, term w will
have a high χ
2
value; this is very problematic. It is straightforward to extract an or-
thogonal set of columns, however, to prevent the matrix from becoming too sparse,
we will cluster terms (i.e., columns).
Many studies address term clustering. Two major approaches
6
are:
Similarity-based clustering If terms w
1
and w
2
have similar distribution of co-
occurrence with other terms, w
1
and w
2
are considered to be in the same
cluster.
Pairwise clustering If terms w
1
and w
2
co-occur frequently, w
1
and w
2
are con-
sidered to be in the same cluster.
Table 4 shows an example of two (transposed) columns extracted from a co-
occurrence matrix. Similarity-based clustering centers upon boldface figures and
pairwise clustering focuses on italic figures.
By similarity-based clustering, terms with the same role, e.g., “Monday,” “Tues-
day,” ..., or “build,” “establish,” and “found” are clustered
13
. In our preliminary
experiment, when applied to a single document similarity-based clustering groups
paraphrases and a phrase and its component (e.g., “digital computer” and “com-
puter”). Similarity of two distributions is measured statistically by Kullback-Leibler
divergence or Jensen-Shannon divergence
2
.
On the other hand, pairwise clustering yields relevant terms in the same clus-
ter: “doctor,” “nurse,” and “hospital”
19
. A frequency of co-occurrence or mutual
information can be used to measure the degree of relevance
1,3
.
Our algorithm uses both types of clustering. First we cluster terms by a simi-
larity measure (using Jensen-Shannon divergence); subsequently, we apply pairwise
clustering (using mutual information). Table 5 shows an example of term cluster-
February 27, 2004 11:40 WSPC/109-IJAIT 00146
164 Y. Matsuo & M. Ishizuka
ing. Proper clustering of frequent terms results in an appropriate χ
2
value for each
term. We don’t take the size of the cluster into account for simplicity. Balancing
the clusters may improve the algorithm performance.
Below, co-occurrence of a term and a cluster implies co-occurrence of the term
and any term in the cluster.
3.3. Algorithm
The algorithm follows. Thresholds are determined by preliminary experiments.
1. Preprocessing: Stem words by Porter algorithm
14
and extract phrases based on
the APriori algorithm
5
. We extract phrases of up to 4 words with frequency
more than 3 times. Discard stop words included in stop list used in the SMART
system
16
.
2. Selection of frequent terms: Select the top frequent terms up to 30% of the
number of running terms, N
total
.
3. Clustering frequent terms: Cluster a pair of terms whose Jensen-Shannon diver-
gence is above the threshold (0.95 ×log 2). Jensen-Shannon divergence is defined
as
J(w
1
, w
2
) = log2 +
1
2
X
w
0
C
h(P (w
0
|w
1
) + P (w
0
|w
2
)) h(P (w
0
|w
1
)) h(P (w
0
|w
2
))
where h(x) = x log x, P (w
0
|w
1
) = f req(w
0
, w
1
)/freq(w
1
). Cluster a pair of
terms whose mutual information is above the threshold (log(2.0)). Mutual infor-
mation between w
1
and w
2
is defined as
M(w
1
, w
2
) = log
P (w
1
, w
2
)
P (w
1
)P (w
2
)
= log
N
total
freq(w
1
, w
2
)
freq(w
1
)freq(w
2
)
.
Two terms are in the same cluster if they are clustered by either of the two
clustering algorithms. The obtained clusters are denoted as C.
4. Calculation of expected probability: Count the number of terms co-occurring
with c C, denoted as n
c
, to yield the expected probability p
c
= n
c
/N
total
.
5. Calculation of χ
02
value: For each term w, count co-occurrence frequency with
c C, denoted as freq(w, c). Count the total number of terms in the sentences
including w, denoted as n
w
. Calculate the χ
02
value following
χ
02
(w) =
X
cG
(freq(w, c) n
w
p
c
)
2
n
w
p
c
max
cG
(freq(w, c) n
w
p
c
)
2
n
w
p
c
.
6. Output keywords: Show a given number of terms having the largest χ
02
value.
In this paper, we use both nouns and verbs because verbs or verb+noun are
sometimes important for illustrating the content of the document. Of course, we
can apply our algorithm only to nouns.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 165
Table 6. Improved results of terms with high χ
2
value.
Rank χ
2
Term Frequency
1 380.4 digital computer 63
2
259.7 storage capacity 11
3
202.5 imitation game 16
4
174.4 machine 203
5
132.2 human mind 2
6
94.1 universality 6
7
93.7 logic 10
8
82.0 property 11
9
77.1 mimic 7
10
77.0 discrete-state machine 17
Table 6 shows the results for Turing’s paper. Important terms are extracted
regardless of their frequencies.
4. Evaluation
For information retrieval, index terms are evaluated by their retrieval performance,
namely recall and precision. However, we claim that our algorithm is useful when
a corpus is not available due to cost or time to collect documents, or in a situation
where document collection is infeasible.
Keywords are sometimes attached to a paper; however, they are not defined in
a consistent way. Therefore, we employ author-based evaluation. Twenty authors
of technical papers in artificial intelligence research have participated in the ex-
periment. For each author, we showed keywords extracted from his/her paper by
tf(term frequency), tfidf
b
, KeyGraph, and our algorithm. KeyGraph
11
is a keyword
extraction algorithm which requires only a single document as does our algorithm.
It calculates term weight based on term co-occurrence information and was recently
used to analyze a variety of data in the context of Chance Discovery
12
.
All these methods use word stem, elimination of stop words, and extraction of
phrases. Using each method we extracted, gathered, and shuffled the top 15 terms.
Then, the authors were asked to check terms which they thought were important
in the paper. Precision can be calculated by the ratio of the checked terms to 15
terms derived by each method. Furthermore, the authors were asked to select five (or
more) terms which they thought were indispensable for the paper. Coverage of each
method was calculated by taking the ratio of the indispensable terms included in the
15 terms to all the indispensable terms. It is desirable to have the indispensable term
list beforehand. However, it is very demanding for authors to provide a keyword
list without seeing a term list. In our experiment, we allowed authors to add any
b
The corpus is 166 papers in JAIR (Journal of Artificial Intelligence Research) from Vol. 1 in
1993 to Vol. 14 in 2001. The idf is defined by log(D/df(w)) + 1, where D is the number of all
documents and df(w) is the number of documents including w.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
166 Y. Matsuo & M. Ishizuka
Table 7. Precision and coverage for 20 technical papers.
tf KeyGraph ours tfidf
Precision 0.53 0.42 0.51 0.55
Coverage
0.48 0.44 0.62 0.61
Frequency index
28.6 17.3 11.5 18.1
Table 8. Results with respect to phrases.
tf KeyGraph ours tfidf
Ratio of phrases 0.11 0.14 0.33 0.33
Precision w/o phrases
0.42 0.36 0.42 0.45
Recall w/o phrases
0.39 0.36 0.46 0.54
terms in the paper to the indispensable term list (even if they were not derived by
any of the methods.).
Results are shown in Table 7. For each method, precision was around 0.5. How-
ever, coverage using our method exceeds that of tf and KeyGraph and is comparable
to that of tfidf; both tf and tfidf selected terms which appeared frequently in the
document (although tfidf considers frequencies in other documents). On the other
hand, our method can extract keywords even if they do not appear frequently. The
frequency index in the table shows average frequency of the top 15 terms. Terms
extracted by tf appear about 28.6 times, on average, while terms by our method
appear only 11.5 times. Therefore, our method can detect “hidden” keywords. We
can use the χ
2
value as a priority criterion for keywords because precision of the
top 10 terms by our method is 0.52, that of the top 5 is 0.60, while that of the top 2
is as high as 0.72. Though our method detects keywords consisting of two or more
words well, it is still nearly comparable to tfidf if we discard such phrases, as shown
in Table 8.
Computational time of our method is shown in Figure 3. The system is imple-
mented in C++ on a Linux OS, Celeron 333MHz CPU machine. Computational
time increases approximately linearly with respect to the number of terms; the
process completes itself in a few seconds if the given number of terms is less than
20,000.
5. Discussion and Related Work
Co-occurrence has attracted interest for a long time in computational linguistics.
For example, co-occurrence in particular syntactic contexts is used for term clus-
tering
13
. Co-occurrence information is also useful for machine translation: for ex-
ample, Tanaka et al. uses co-occurrence matrices of two languages to translate an
ambiguous term
19
. Co-occurrence is also used for query expansion in information
retrieval
17
.
Weighting a term by occurrence dates back to the 1950s in the study by Luhn
8
.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 167
0
2
4
6
8
10
0 5000 10000 15000 20000 25000 30000
time(s)
number of terms in a document
time
Fig. 3. Number of total terms and computational time.
More elaborate measures of term occurrence have been developed
18,10
by essentially
counting term frequencies. Kageura and Umino summarized five groups of weighting
measure
7
:
(i) a word which appears in a document is likely to be an index term;
(ii) a word which appears frequently in a document is likely to be an index term;
(iii) a word which appears only in a limited number of documents is likely to be
an index term for these documents;
(iv) a word which appears relatively more frequently in a document than in the
whole database is likely to be an index term for that document;
(v) a word which shows a specific distributional characteristic in the database is
likely to be an index term for the database.
Our algorithm corresponds to approach (v). Nagao used the χ
2
value to calculate
February 27, 2004 11:40 WSPC/109-IJAIT 00146
168 Y. Matsuo & M. Ishizuka
the weight of words
9
, which also corresponds to approach (v). But our method
uses a co-occurrence matrix instead of a corpus, enabling keyword extraction using
only the document itself.
From a probabilistic point of view, a method for estimating probability of previ-
ously unseen word combinations is important
2
. Several papers have addressed this
issue, but our algorithm uses co-occurrence with frequent terms, which alleviates
the estimation problem.
In the context of text mining, to discover keywords or keyword relationships is
an important topic
4,15
. The general purpose of knowledge discovery is to extract
implicit, previously unknown, and potentially useful information from data. Our
algorithm can be considered a text mining tool in that it extracts important terms
even if they are rare.
6. Conclusion
In this paper, we developed an algorithm to extract keywords from a single docu-
ment. Main advantages of our method are its simplicity without requiring use of
a corpus and its high performance comparable to tfidf. As more electronic docu-
ments become available, we believe our method will be useful in many applications,
especially for domain-independent keyword extraction.
References
[1] K. W. Church and P. Hanks. Word association norms, mutual information, and lex-
icography. Computational Linguistics, 16(1):22, 1990.
[2] I. Dagan, L. Lee, and F. Pereira. Similarity-based models of word cooccurrence prob-
abilities. Machine Learning, 34(1):43, 1999.
[3] T. Dunning. Accurate methods for the statistics of surprise and coincidence. Com-
putational Linguistics, 19(1):61, 1993.
[4] R. Feldman, M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler,
and O. Zamir. Text mining at the term level. In Proceedings of the Second European
Symposium on Principles of Data Mining and Knowledge Discovery, page 65, 1998.
[5] Johannes urnkranz. A study using n-grams features for text categorization. Techni-
cal Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence,
1998.
[6] Thomas Hofmann and Jan Puzicha. Statistical models for co-occurrence data. Tech-
nical Report AIM-1625, Massachusetts Institute of Technology, 1998.
[7] K. Kageura and B. Umino. Methods of automatic term recognition. Terminology,
3(2):259, 1996.
[8] H. P. Luhn. A statistical approach to mechanized encoding and searching of literary
information. IBM Journal of Research and Development, 1(4):390, 1957.
[9] M. Nagao, M. Mizutani, and H. Ikeda. An automated method of the extraction of
important words from Japanese scientific documents. Transactions of Information
Processing Society of Japan, 17(2):110, 1976.
[10] T. Noreault, M. McGill, and M. B. Koll. A Performance Evaluation of Similarity
Measure, Document Term Weighting Schemes and Representations in a Boolean En-
vironment. Butterworths, London, 1977.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 169
[11] Y. Ohsawa, N. E. Benson, and M. Yachida. KeyGraph: Automatic indexing by co-
occurrence graph based on building construction metaphor. In Proceedings of the
Advanced Digital Library Conference, 1998.
[12] Yukio Ohsawa. Chance discoveries for making decisions in complex real world. New
Generation Computing, to appear.
[13] F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In
Proceedings of the 31th Meeting of the Association for Computational Linguistics,
pages 183–190, 1993.
[14] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130, 1980.
[15] M. Rajman and R. Besancon. Text mining knowledge extraction from unstructured
textual data. In Proceedings of the 6th Conference of International Federation of
Classification Societies, 1998.
[16] G. Salton. Automatic Text Processing. Addison-Wesley, 1988.
[17] H. Schutze and J. O. Pederson. A cooccurrence-based thesaurus and two applications
to information retrieval. Information Processing and Management, 33(3):307–318,
1997.
[18] K. Sparck-Jones. A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28(5):111, 1972.
[19] K. Tanaka and H. Iwasaki. Extraction of lexical translations from non-aligned cor-
pora. In Proceedings of the 16th International Conference on Computational Linguis-
tics, page 580, 1996.
[20] A. M. Turing. Computing machinery and intelligence. Mind, 59:433, 1950.
... However TF-IDF does require a corpus. Matsuo and Ishizuka [165] use the χ 2 measure to detect keyphrases. They evaluate their method by showing paper authors the top 15 keywords extracted by their method and TF-IDF and ask them to decide which words are relevant. ...
Thesis
Full-text available
To understand risk in a financial market we must understand how asset prices are related. By using correlation measures we can quantify the relationships between asset pairs, and then using network science we can then attempt to study the system as a whole. In this thesis we explore the use of various correlation and partial correlation estimators to estimate a financial network from returns data. We then show how these networks differ from the standard Pearson correlation models and attempt to evaluate their use. We firstly explore the use of sparse precision matrix estimators, and compare their success in selecting a known underlying model to simply thresholding the sample covariance matrix. Surprisingly we find that thresholding the sample covariance matrix is competitive with these sparse precision matrix estimators. Next we look at using a selection of these estimators for portfolio optimization. We find that in general they do not outperform non-sparse methods, such as the Ledoit-Wolf shrinkage estimators, but can provide some added robustness, and do force diversification upon the portfolios. We then look at constructing networks using these precision matrix estimations. Firstly we use the Ledoit-Wolf shrinkage method to construct partial correlation networks of the S&P500. These partial correlation networks have a significantly less variable largest eigenvalue than a correlation network, indicating the effect of the market has been removed, but in fact are more unstable than their correlation counterparts, with both the largest eigenvector and community structure changing significantly more between adjacent time periods. Furthermore we have less success in uncovering the underlying sector structure in these partial correlation networks compared to the equivalent correlation ones. These Ledoit-Wolf estimated networks are dense, which can inhibit interpretability. Therefore next we look at using a sparse precision matrix estimator, the SPACE method. Again these networks seem quite unstable, with a large number of edge changes between networks adjacent in time, indicating that partial correlation networks in general are more unstable than their correlation counterparts. Next we explore the use of rank correlation methods for the construction of minimum spanning trees from financial returns, and explore how these compare to those constructed using Pearson correlation. We find that the trees constructed using these rank methods correlation tend to be more stable and maintain more edges over the dataset than those constructed using Pearson correlation and the trees have similar topologies. We also explore how deviations from Gaussianity drive differences in the trees. There is little correlation between MST differences and deviations from univariate Gaussianity, but if we use quantile normalization to force the dataset to be univariate Gaussian then the differences between the MSTs drops, indicating this does have an effect. Finally we look at how the similarity and stability of correlation networks changes during times of market calm and market stress. Using some simple measures, such as the change and standard deviation of the entries in the leading eigenvector and the mean L2 difference between nodes, we look at three different markets, the US, UK and Germany, and find that the UK and US markets become more similar and more stable during times of market stress, but the German market does not see such effects.
... Unsupervised techniques for keyword extraction can generally be split into several categories [54], notably: (i) statistical, which are based on simple word features, such as term frequency or relational position in the document [15,129], word co-occurrence [102] or keyphrase co-occurrence counts [136]; and (ii) graph-based, which utilize a word/phrase graph constructed from the document and extract keywords using graph ranking methods [13,106]. We consider two unsupervised keyword extraction architectures as baselines: statistical method RAKE [136] and graph-based model TextRank [106]. ...
Article
Full-text available
In this work we propose a statistical approach to identify unigram keywords for a document. We identify unigram keywords as features which effectively captures the importance of a word in a document and evaluates its potential to be a keyword. We make use of relative entropy, displacement and variance of terms in a document have been evaluated in the context of keyword identification. The proposed approach works on single documents without the requirement of any pre-training of the model. We also evaluate the effectiveness of our features against the gold standard of “term frequency” and compare the usefulness of the proposed feature set with term frequency. The results of our proposed method are presented and compared with existing algorithms.
Article
Full-text available
In this paper we review a wide spectrum of techniques which have been proposed in literature to enable acceptable recognition of language and text by machines. We discuss many techniques which have been proposed by researchers in the field of term weighting and explore the mathematical foundations of these methods. Term weighting schemes have broadly been classified as supervised and statistical methods and we present numerous examples from both categories to highlight the difference in approaches between the two broad categories. We pay particular attention to the Vector Space Model and its variants which form the basis of many of the other methods which have been discussed in the paper.
Article
This work describes automatic term extraction approach based on the combination of the probabilistic topic modelling (PTM) and non-negative matrix factorization (NMF). Topic modeling algorithms including NMF-based ones do not require expensive and time-consuming manual annotations for domain terms, but only a corpus of domain documents. The topics emerge from the corpus documents without any supervision as sets of most probable words. This work is aimed to investigate how fully and precisely these most probable words from topics can reflect domain terminology. We run a series of experiments on the novel, qualitatively annotated dataset ACTER that was first used in the TermEval 2020 Shared Task. We compare five different NMF algorithms and four different NMF initializations when changing the number of topics extracted from documents and the number of most probable words extracted from topics in order to determine optimal combinations for best performance of term extraction. Finally, we compare the obtained optimal combinations of NMF with the competitive methods in TermEval 2020 and prove that our approach is second only to two much more sophisticated, domain-dependent supervised methods.
Article
In this paper, we propose Ranksum, an approach for extractive text summarization of single documents based on the rank fusion of four multi-dimensional sentence features extracted for each sentence: topic information, semantic content, significant keywords, and position. The Ranksum obtains the sentence saliency rankings corresponding to each feature in an unsupervised way followed by the weighted fusion of the four scores to rank the sentences according to their significance. The scores are generated in completely unsupervised way, and a labeled document set is required to learn the fusion weights. Since we found that the fusion weights can generalize to other datasets, we consider the Ranksum as an unsupervised approach. To determine topic rank, we employ probabilistic topic models whereas semantic information is captured using sentence embeddings. To derive rankings using sentence embeddings, we utilize Siamese networks to produce abstractive sentence representation and then we formulate a novel strategy to arrange them in their order of importance. A graph-based strategy is applied to find the significant keywords and related sentence rankings in the document. We also formulate a sentence novelty measure based on bigrams, trigrams, and sentence embeddings to eliminate redundant sentences from the summary. The ranks of all the sentences -computed for each feature- are finally fused to get the final score for each sentence in the document. We evaluate our approach on publicly available summarization datasets- CNN/DailyMail and DUC 2002. Experimental results show that our approach outperforms other existing state-of-the-art summarization methods.
Chapter
Police text data has the characteristics of short length and professional vocabulary, so how to mine and analyze it and find its potential information is always a difficult problem for public security organs. The key to deal with the massive police short text is to extract the most concerned problems of users. It is the most effective method to pry the theme of the whole text through keywords. In order to effectively extract the potential key information of a large number of police short text data, a keyword extraction model based on BTM-ALBERT technology is proposed. The paper used more than 6 million cases data in a city, to construct a domain pre-training model for the police ontology word representation model, and take the text data of burglary cases as an example, appling the proposed model to carry out a police key intelligence extraction experiment. It shows that the proposed model compared with traditional machine learning methods, the effect of extracting keywords is significantly improved,the overall average similarity was 31.2%, 21.5% and 8.3% higher than LDA, textrank and BTM.It is necessary to further test the generalization ability of the model on more police thematic data or large-scale data sets in more fields.the proposed method verifies the effectiveness of the proposed model in short text police data intelligence mining and field application value.
Conference Paper
In this paper, we present an algorithm for extracting keywords representing the asserted main point in a document, without relying on external devices such as natural language processing tools or a document corpus. Our algorithm KeyGraph is based on the segmentation of a graph, representing the co-occurrence between terms in a document, into {\it clusters}. Each cluster corresponds to a concept on which author's idea is based, and top ranked terms by a statistic based on each term's relationship to these clusters are selected as keywords. This strategy comes from considering that a document is constructed like a building for expressing new ideas based on traditional concepts.The experimental results show that thus extracted terms match author's point quite accurately, even though KeyGraph does not use each term's average frequency in a corpus, i.e., KeyGraph is a content-sensitive, domain independent device of indexing.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Chapter
I propose to consider the question, “Can machines think?”♣ This should begin with definitions of the meaning of the terms “machine” and “think”. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, “Can machines think?” is to be sought in a statistical survey such as a Gallup poll.
Article
Written communication of ideas is carried out on the basis of statistical probability in that a writer chooses that level of subject specificity and that combination of words which he feels will convey the most meaning. Since this process varies among individuals and since similar ideas are therefore relayed at different levels of specificity and by means of different words, the problem of literature searching by machines still presents major difficulties. A statistical approach to this problem will be outlined and the various steps of a system based on this approach will be described. Steps include the statistical analysis of a collection of documents in a field of interest, the establishment of a set of “notions” and the vocabulary by which they are expressed, the compilation of a thesaurus-type dictionary and index, the automatic encoding of documents by machine with the aid of such a dictionary, the encoding of topological notations (such as branched structures), the recording of the coded information, the establishment of a searching pattern for finding pertinent information, and the programming of appropriate machines to carry out a search.
Article
This paper presents a new method for computing a thesaurus from a text corpus. Each word is represented as a vector in a multi-dimensional space that captures cooccurrence information. Words are defined to be similar if they have similar cooccurrence patterns. Two different methods for using these thesaurus vectors in information retrieval are shown to significantly improve performance over the Tipster reference corpus as compared to a term vector space baseline.