ArticlePDF Available

Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information

Authors:

Abstract and Figures

We present a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, then a set of cooccurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Co-occurrence distribution shows importance of a term in the document as follows. If probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword. The degree of biases of distribution is measured by the chi^2-measure. Our algorithm shows comparable performance to tfidf without using a corpus.
Content may be subject to copyright.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
International Journal on Artificial Intelligence Tools
Vol. 13, No. 1 (2004) 157–169
c
World Scientific Publishing Company
KEYWORD EXTRACTION FROM A SINGLE DOCUMENT USING
WORD CO-OCCURRENCE STATISTICAL INFORMATION
Y. MATSUO
National Institute of Advanced Industrial Science and Technology
y.matsuo@aist.go.jp
M. ISHIZUKA
University of Tokyo
ishizuka@miv.t.u-tokyo.ac.jp
Received 18 July 2003
Revised 19 October 2003
Accepted 19 October 2003
We present a new keyword extraction algorithm that applies to a single document with-
out using a corpus. Frequent terms are extracted first, then a set of co-occurrences
between each term and the frequent terms, i.e., occurrences in the same sentences, is
generated. Co-occurrence distribution shows importance of a term in the document as
follows. If the probability distribution of co-occurrence between term a and the frequent
terms is biased to a particular subset of frequent terms, then term a is likely to be
a keyword. The degree of bias of a distribution is measured by the χ
2
-measure. Our
algorithm shows comparable performance to tfidf without using a corpus.
Keywords: Keyword extraction; co-occurrence; χ
2
-measure.
1. Introduction
Keyword extraction is an important technique for document retrieval, Web page re-
trieval, document clustering, summarization, text mining, and so on. By extracting
appropriate keywords, we can easily choose which document to read to learn the
relationship among documents. A popular algorithm for indexing is the tfidf mea-
sure, which extracts keywords that appear frequently in a document, but that don’t
appear frequently in the remainder of the corpus. The term “keyword extraction”
is used in the context of text mining, for example
15
. A comparable research topic is
called “automatic term recognition” in the context of computational linguistics and
“automatic indexing” or “automatic keyword extraction” in information retrieval
research.
Recently, numerous documents have been made available electronically. Domain-
independent keyword extraction, which does not require a large corpus, has many
applications. For example, if one encounters a new Web page, one might like to know
157
February 27, 2004 11:40 WSPC/109-IJAIT 00146
158 Y. Matsuo & M. Ishizuka
the contents quickly by some means, e.g., by having the keywords highlighted. If
one wants to know the main assertion of a paper, one would want to have some
keywords. In these cases, keyword extraction without a corpus of the same kind
of documents is very useful. Word count
8
is sometimes sufficient for document
overview; however, a more powerful tool is desirable.
This paper explains a keyword extraction algorithm based solely on a single
document. First, frequent terms are extracted. Co-occurrences of a term and fre-
quent terms are counted. If a term appears frequently with a particular subset of
terms, the term is likely to have important meaning. The degree of bias of the co-
occurrence distribution is measured by the χ
2
-measure. We show that our keyword
extraction performs well without the need for a corpus. In this paper, a term is
defined as a word or a word sequence. We do not intend to limit the meaning in a
terminological sense. A word sequence is written as a phrase.
This paper is organized as follows. The next section describes our idea of key-
word extraction. We describe the algorithm in detail followed by evaluation and
discussion. Finally, we summarize our contributions.
2. Term Co-occurrence and Importance
A document consists of sentences. In this paper, a sentence is considered to be a set
of words separated by a stop mark (“.”, “?” or “!”). We also include document titles,
section titles, and captions as sentences. Two terms in a sentence are considered to
co-occur once. That is, we see each sentence as a “basket,” ignoring term order and
grammatical information except when extracting word sequences.
We can obtain frequent terms by counting term frequencies. Let us take a very
famous paper by Alan Turing
20
as an example. Table 1 shows the top ten frequent
terms and the probability of occurrence, normalized so that the sum is to be 1
(i.e., normalized relative frequency). Next, a co-occurrence matrix is obtained by
counting frequencies of pairwise term co-occurrences, as shown in Table 2. For
example, term a and term b co-occur in 30 sentences in the document. Let N denote
the number of different terms in the document. While the term co-occurrence matrix
is an N × N symmetric matrix, Table 2 shows only a part of the whole – an N × 10
matrix. We do not define diagonal components.
Assuming that term w appears independently from frequent terms (denoted as
G), the distribution of co-occurrence of term w and the frequent terms is similar to
the unconditional distribution of occurrence of the frequent terms shown in Table 1.
Table 1. Frequency and probability distribution.
Frequent term a b c d e f g h i j Total
Frequency 203 63 44 44 39 36 35 33 30 28 555
Probability 0.366 0.114 0.079 0.079 0.070 0.065 0.063 0.059 0.054 0.050 1.0
a
machine,
b
computer,
c
question,
d
digital,
e
answer,
f
game,
g
argument,
h
make,
i
state,
j
number
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 159
Table 2. A co-occurrence matrix.
a b c d e f g h i j Total
a 30 26 19 18 12 12 17 22 9 165
b
30 5 50 6 11 1 3 2 3 111
c
26 5 4 23 7 0 2 0 0 67
d
19 50 4 3 7 1 1 0 4 89
e
18 6 23 3 7 1 2 1 0 61
f 12 11 7 7 7 2 4 0 0 50
g
12 1 0 1 1 2 5 1 0 23
h
17 3 2 1 2 4 5 0 0 34
i
22 2 0 0 1 0 1 0 7 33
j
9 3 0 4 0 0 0 0 7 23
· · ·
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
u
6 5 5 3 3 18 2 2 1 0 45
v
13 40 4 35 3 6 1 0 0 2 104
w
11 2 2 1 1 0 1 4 0 0 22
x
17 3 2 1 2 4 5 0 0 0 34
u: imitation, v: digital computer, w: kind, x: make
Fig. 1. Co-occurrence probability distribution of the terms “kind”, “make”, and frequent terms.
Conversely, if term w has a semantic relation with a particular set of terms g G,
co-occurrence of term w and g is greater than expected, the distribution is said to
be biased.
Figures 1 and 2 show the co-occurrence probability distribution of some terms
and frequent terms. In the figures, unconditional distribution of frequent terms is
shown as “unconditional”. A general term such as ‘kind” or “make” is used relatively
impartially with each frequent term, while a term such as “imitation” or “digital
computer” shows co-occurrence with particular terms. These biases are derived
from either semantic, lexical, or other relationships between two terms. Thus, a
February 27, 2004 11:40 WSPC/109-IJAIT 00146
160 Y. Matsuo & M. Ishizuka
Fig. 2. Co-occurrence probability distribution of the terms “imitation”, “digital computer”, and
frequent terms.
term with co-occurrence biases may have an important meaning in a document. In
this example, “imitation” and “digital computer” are important terms, as we all
know: In this paper, Turing proposed an “imitation game” to replace the question
“Can machines think?”
Therefore, the degree of biase of co-occurrence can be used as a indicator of
term importance. However, if term frequency is small, the degree of biases is not
reliable. For example, assume term w
1
appears only once and co-occurs only with
term a once (probability 1.0). At the other extreme, assume term w
2
appears 100
times and co-occurs only with term a 100 times (with probability 1.0). Intuitively,
w
2
seems more reliably biased. In order to evaluate statistical significance of biases,
we use the χ
2
test, which is very common for evaluating biases between expected
frequencies and observed frequencies. For each term, frequency of co-occurrence
with the frequent terms is regarded as a sample value; the null hypothesis is that
“occurrence of frequent terms G is independent from occurrence of term w,” which
we expect to reject.
We denote the unconditional probability of a frequent term g G as the ex-
pected probability p
g
and the total number of co-occurrences of term w and fre-
quent terms G as n
w
. Frequency of co-occurrence of term w and term g is written
as freq(w, g). The statistical value of χ
2
is defined as
χ
2
(w) =
X
gG
(freq(w, g) n
w
p
g
)
2
n
w
p
g
. (1)
If χ
2
(w) > χ
2
α
, the null hypothesis is rejected with significance level α. The term
n
w
p
g
represents the expected frequency of co-occurrence; and (freq(w, g) n
w
p
g
)
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 161
Table 3. Terms with high χ
2
value.
Rank χ
2
Term Frequency
1 593.7 digital computer 31
2
179.3 imitation game 16
3
163.1 future 4
4
161.3 question 44
5
152.8 internal 3
6
143.5 answer 39
7
142.8 input signal 3
8
137.7 moment 2
9
130.7 play 8
10
123.0 output 15
.
.
.
.
.
.
.
.
.
.
.
.
553
0.8 Mr. 2
554
0.8 sympathetic 2
555 0.7 leg 2
556
0.7 chess 2
557
0.6 Pickwick 2
558
0.6 scan 2
559
0.3 worse 2
560
0.1 eye 2
(We set the top ten frequent terms as G.)
represents the difference between observed and expected frequencies. Therefore,
large χ
2
(w) indicates that co-occurrence of term w shows strong bias. In this paper,
we use the χ
2
-measure as an index of biases, not for tests of hypotheses.
Table 3 shows terms with high χ
2
values and ones with low χ
2
values in Turing’s
paper. Generally, terms with large χ
2
are relatively important in the document;
terms with small χ
2
are relatively trivial. The table excludes terms whose frequency
is less than two. However, we don’t have to define such a threshold, because low
frequency usually indicates low χ
2
value (unless n
w
p
g
is very large, which is quite
unusual.)
In summary, our algorithm first extracts frequent terms as a “standard”; then
it extracts terms with high deviation from the standard as keywords.
3. Algorithm Description and Improvement
In the previous section, the basic idea of our algorithm is described. This section
gives the precise algorithm description and two algorithm improvements: calculation
of χ
2
value and clustering of terms. These improvements lead to better performance.
3.1. Calculation of χ
2
values
To improve the calculation of the χ
2
value, we focus on two aspects: variety of
sentence length and robustness of the χ
2
value.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
162 Y. Matsuo & M. Ishizuka
First, we consider the length of sentences. A document consists of sentences of
various lengths. If a term appears in a long sentence, it is likely to co-occur with
many terms; if a term appears in a short sentence, it is less likely to co-occur with
other terms. We consider the length of each sentence and revise our definitions. We
denote
p
g
as (the sum of the total number of terms in sentences where g appears)
divided by (the total number of terms in the document),
n
w
as the total number of terms in sentences where w appears.
Again n
w
p
g
represents the expected frequency of co-occurrence. However, its value
becomes more precise.
a
Second, we consider the robustness of the χ
2
value. A term co-occurring with
a particular term g G has a high χ
2
value. However, these terms are sometimes
adjuncts of term g and not important terms. For example, in Table 3, a term
“future” or “internal” co-occurs selectively with the frequent term “state,” because
these terms are used in the form of “future state” and “internal state.” Though
χ
2
values for these terms are high, “future” and “internal” themselves are not
important. Assuming that “state” is not a frequent term, χ
2
values of these terms
diminish rapidly.
We use the following function to measure robustness of bias values; it subtracts
the maximal term from the χ
2
value,
χ
02
(w) = χ
2
(w) max
gG
(freq(w, g) n
w
p
g
)
2
n
w
p
g
. (2)
Using this function, we can estimate χ
02
(w) as low if w co-occurs selectively with
only one term. It will have a high value if w co-occurs selectively with more than
one term.
3.2. Clustering of terms
Some terms co-occur with each other and clusters of terms are obtained by combin-
ing co-occurring terms. Below we show how to calculate the χ
2
value more reliably
by clustering terms.
A co-occurrence matrix is originally an N × N matrix, where columns corre-
sponding to frequent terms are extracted for calculation. We ignore the remaining
columns, i.e., co-occurrence with low frequency terms, because it is difficult to es-
timate precise probability of occurrence for low frequency terms.
To improve extracted keyword quality, it is very important to select the proper
set of columns from a co-occurrence matrix. The set of columns is preferably or-
thogonal; assuming that terms g
1
and g
2
appear together very often, co-occurrence
a
p
g
is the probability of a term in a document to co-occur with g. Each term can co-occur with
multiple terms. Therefore, the sum of p
g
for all terms is not 1.0 but the average number of frequent
terms in a sentence.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 163
Table 4. Two transposed columns.
a b c d e f g h i j . . .
c 26 5 4 23 7 0 2 0 0 . . .
e
18 6 23 3 7 1 2 1 0 . . .
Table 5. Clustering of the top 49 frequent terms.
C1: game, imitation, imitation game, play, programme
C2: system, rules, result, important
C3: computer, digital, digital computer
C4: behaviour, random, law
C5: capacity, storage
C6: question, answer
· · · · · ·
C26: human
C27: state
C28: learn
of terms w and g
1
might imply the co-occurrence of w and g
2
. Thus, term w will
have a high χ
2
value; this is very problematic. It is straightforward to extract an or-
thogonal set of columns, however, to prevent the matrix from becoming too sparse,
we will cluster terms (i.e., columns).
Many studies address term clustering. Two major approaches
6
are:
Similarity-based clustering If terms w
1
and w
2
have similar distribution of co-
occurrence with other terms, w
1
and w
2
are considered to be in the same
cluster.
Pairwise clustering If terms w
1
and w
2
co-occur frequently, w
1
and w
2
are con-
sidered to be in the same cluster.
Table 4 shows an example of two (transposed) columns extracted from a co-
occurrence matrix. Similarity-based clustering centers upon boldface figures and
pairwise clustering focuses on italic figures.
By similarity-based clustering, terms with the same role, e.g., “Monday,” “Tues-
day,” ..., or “build,” “establish,” and “found” are clustered
13
. In our preliminary
experiment, when applied to a single document similarity-based clustering groups
paraphrases and a phrase and its component (e.g., “digital computer” and “com-
puter”). Similarity of two distributions is measured statistically by Kullback-Leibler
divergence or Jensen-Shannon divergence
2
.
On the other hand, pairwise clustering yields relevant terms in the same clus-
ter: “doctor,” “nurse,” and “hospital”
19
. A frequency of co-occurrence or mutual
information can be used to measure the degree of relevance
1,3
.
Our algorithm uses both types of clustering. First we cluster terms by a simi-
larity measure (using Jensen-Shannon divergence); subsequently, we apply pairwise
clustering (using mutual information). Table 5 shows an example of term cluster-
February 27, 2004 11:40 WSPC/109-IJAIT 00146
164 Y. Matsuo & M. Ishizuka
ing. Proper clustering of frequent terms results in an appropriate χ
2
value for each
term. We don’t take the size of the cluster into account for simplicity. Balancing
the clusters may improve the algorithm performance.
Below, co-occurrence of a term and a cluster implies co-occurrence of the term
and any term in the cluster.
3.3. Algorithm
The algorithm follows. Thresholds are determined by preliminary experiments.
1. Preprocessing: Stem words by Porter algorithm
14
and extract phrases based on
the APriori algorithm
5
. We extract phrases of up to 4 words with frequency
more than 3 times. Discard stop words included in stop list used in the SMART
system
16
.
2. Selection of frequent terms: Select the top frequent terms up to 30% of the
number of running terms, N
total
.
3. Clustering frequent terms: Cluster a pair of terms whose Jensen-Shannon diver-
gence is above the threshold (0.95 ×log 2). Jensen-Shannon divergence is defined
as
J(w
1
, w
2
) = log2 +
1
2
X
w
0
C
h(P (w
0
|w
1
) + P (w
0
|w
2
)) h(P (w
0
|w
1
)) h(P (w
0
|w
2
))
where h(x) = x log x, P (w
0
|w
1
) = f req(w
0
, w
1
)/freq(w
1
). Cluster a pair of
terms whose mutual information is above the threshold (log(2.0)). Mutual infor-
mation between w
1
and w
2
is defined as
M(w
1
, w
2
) = log
P (w
1
, w
2
)
P (w
1
)P (w
2
)
= log
N
total
freq(w
1
, w
2
)
freq(w
1
)freq(w
2
)
.
Two terms are in the same cluster if they are clustered by either of the two
clustering algorithms. The obtained clusters are denoted as C.
4. Calculation of expected probability: Count the number of terms co-occurring
with c C, denoted as n
c
, to yield the expected probability p
c
= n
c
/N
total
.
5. Calculation of χ
02
value: For each term w, count co-occurrence frequency with
c C, denoted as freq(w, c). Count the total number of terms in the sentences
including w, denoted as n
w
. Calculate the χ
02
value following
χ
02
(w) =
X
cG
(freq(w, c) n
w
p
c
)
2
n
w
p
c
max
cG
(freq(w, c) n
w
p
c
)
2
n
w
p
c
.
6. Output keywords: Show a given number of terms having the largest χ
02
value.
In this paper, we use both nouns and verbs because verbs or verb+noun are
sometimes important for illustrating the content of the document. Of course, we
can apply our algorithm only to nouns.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 165
Table 6. Improved results of terms with high χ
2
value.
Rank χ
2
Term Frequency
1 380.4 digital computer 63
2
259.7 storage capacity 11
3
202.5 imitation game 16
4
174.4 machine 203
5
132.2 human mind 2
6
94.1 universality 6
7
93.7 logic 10
8
82.0 property 11
9
77.1 mimic 7
10
77.0 discrete-state machine 17
Table 6 shows the results for Turing’s paper. Important terms are extracted
regardless of their frequencies.
4. Evaluation
For information retrieval, index terms are evaluated by their retrieval performance,
namely recall and precision. However, we claim that our algorithm is useful when
a corpus is not available due to cost or time to collect documents, or in a situation
where document collection is infeasible.
Keywords are sometimes attached to a paper; however, they are not defined in
a consistent way. Therefore, we employ author-based evaluation. Twenty authors
of technical papers in artificial intelligence research have participated in the ex-
periment. For each author, we showed keywords extracted from his/her paper by
tf(term frequency), tfidf
b
, KeyGraph, and our algorithm. KeyGraph
11
is a keyword
extraction algorithm which requires only a single document as does our algorithm.
It calculates term weight based on term co-occurrence information and was recently
used to analyze a variety of data in the context of Chance Discovery
12
.
All these methods use word stem, elimination of stop words, and extraction of
phrases. Using each method we extracted, gathered, and shuffled the top 15 terms.
Then, the authors were asked to check terms which they thought were important
in the paper. Precision can be calculated by the ratio of the checked terms to 15
terms derived by each method. Furthermore, the authors were asked to select five (or
more) terms which they thought were indispensable for the paper. Coverage of each
method was calculated by taking the ratio of the indispensable terms included in the
15 terms to all the indispensable terms. It is desirable to have the indispensable term
list beforehand. However, it is very demanding for authors to provide a keyword
list without seeing a term list. In our experiment, we allowed authors to add any
b
The corpus is 166 papers in JAIR (Journal of Artificial Intelligence Research) from Vol. 1 in
1993 to Vol. 14 in 2001. The idf is defined by log(D/df(w)) + 1, where D is the number of all
documents and df(w) is the number of documents including w.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
166 Y. Matsuo & M. Ishizuka
Table 7. Precision and coverage for 20 technical papers.
tf KeyGraph ours tfidf
Precision 0.53 0.42 0.51 0.55
Coverage
0.48 0.44 0.62 0.61
Frequency index
28.6 17.3 11.5 18.1
Table 8. Results with respect to phrases.
tf KeyGraph ours tfidf
Ratio of phrases 0.11 0.14 0.33 0.33
Precision w/o phrases
0.42 0.36 0.42 0.45
Recall w/o phrases
0.39 0.36 0.46 0.54
terms in the paper to the indispensable term list (even if they were not derived by
any of the methods.).
Results are shown in Table 7. For each method, precision was around 0.5. How-
ever, coverage using our method exceeds that of tf and KeyGraph and is comparable
to that of tfidf; both tf and tfidf selected terms which appeared frequently in the
document (although tfidf considers frequencies in other documents). On the other
hand, our method can extract keywords even if they do not appear frequently. The
frequency index in the table shows average frequency of the top 15 terms. Terms
extracted by tf appear about 28.6 times, on average, while terms by our method
appear only 11.5 times. Therefore, our method can detect “hidden” keywords. We
can use the χ
2
value as a priority criterion for keywords because precision of the
top 10 terms by our method is 0.52, that of the top 5 is 0.60, while that of the top 2
is as high as 0.72. Though our method detects keywords consisting of two or more
words well, it is still nearly comparable to tfidf if we discard such phrases, as shown
in Table 8.
Computational time of our method is shown in Figure 3. The system is imple-
mented in C++ on a Linux OS, Celeron 333MHz CPU machine. Computational
time increases approximately linearly with respect to the number of terms; the
process completes itself in a few seconds if the given number of terms is less than
20,000.
5. Discussion and Related Work
Co-occurrence has attracted interest for a long time in computational linguistics.
For example, co-occurrence in particular syntactic contexts is used for term clus-
tering
13
. Co-occurrence information is also useful for machine translation: for ex-
ample, Tanaka et al. uses co-occurrence matrices of two languages to translate an
ambiguous term
19
. Co-occurrence is also used for query expansion in information
retrieval
17
.
Weighting a term by occurrence dates back to the 1950s in the study by Luhn
8
.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 167
0
2
4
6
8
10
0 5000 10000 15000 20000 25000 30000
time(s)
number of terms in a document
time
Fig. 3. Number of total terms and computational time.
More elaborate measures of term occurrence have been developed
18,10
by essentially
counting term frequencies. Kageura and Umino summarized five groups of weighting
measure
7
:
(i) a word which appears in a document is likely to be an index term;
(ii) a word which appears frequently in a document is likely to be an index term;
(iii) a word which appears only in a limited number of documents is likely to be
an index term for these documents;
(iv) a word which appears relatively more frequently in a document than in the
whole database is likely to be an index term for that document;
(v) a word which shows a specific distributional characteristic in the database is
likely to be an index term for the database.
Our algorithm corresponds to approach (v). Nagao used the χ
2
value to calculate
February 27, 2004 11:40 WSPC/109-IJAIT 00146
168 Y. Matsuo & M. Ishizuka
the weight of words
9
, which also corresponds to approach (v). But our method
uses a co-occurrence matrix instead of a corpus, enabling keyword extraction using
only the document itself.
From a probabilistic point of view, a method for estimating probability of previ-
ously unseen word combinations is important
2
. Several papers have addressed this
issue, but our algorithm uses co-occurrence with frequent terms, which alleviates
the estimation problem.
In the context of text mining, to discover keywords or keyword relationships is
an important topic
4,15
. The general purpose of knowledge discovery is to extract
implicit, previously unknown, and potentially useful information from data. Our
algorithm can be considered a text mining tool in that it extracts important terms
even if they are rare.
6. Conclusion
In this paper, we developed an algorithm to extract keywords from a single docu-
ment. Main advantages of our method are its simplicity without requiring use of
a corpus and its high performance comparable to tfidf. As more electronic docu-
ments become available, we believe our method will be useful in many applications,
especially for domain-independent keyword extraction.
References
[1] K. W. Church and P. Hanks. Word association norms, mutual information, and lex-
icography. Computational Linguistics, 16(1):22, 1990.
[2] I. Dagan, L. Lee, and F. Pereira. Similarity-based models of word cooccurrence prob-
abilities. Machine Learning, 34(1):43, 1999.
[3] T. Dunning. Accurate methods for the statistics of surprise and coincidence. Com-
putational Linguistics, 19(1):61, 1993.
[4] R. Feldman, M. Fresko, Y. Kinar, Y. Lindell, O. Liphstat, M. Rajman, Y. Schler,
and O. Zamir. Text mining at the term level. In Proceedings of the Second European
Symposium on Principles of Data Mining and Knowledge Discovery, page 65, 1998.
[5] Johannes urnkranz. A study using n-grams features for text categorization. Techni-
cal Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence,
1998.
[6] Thomas Hofmann and Jan Puzicha. Statistical models for co-occurrence data. Tech-
nical Report AIM-1625, Massachusetts Institute of Technology, 1998.
[7] K. Kageura and B. Umino. Methods of automatic term recognition. Terminology,
3(2):259, 1996.
[8] H. P. Luhn. A statistical approach to mechanized encoding and searching of literary
information. IBM Journal of Research and Development, 1(4):390, 1957.
[9] M. Nagao, M. Mizutani, and H. Ikeda. An automated method of the extraction of
important words from Japanese scientific documents. Transactions of Information
Processing Society of Japan, 17(2):110, 1976.
[10] T. Noreault, M. McGill, and M. B. Koll. A Performance Evaluation of Similarity
Measure, Document Term Weighting Schemes and Representations in a Boolean En-
vironment. Butterworths, London, 1977.
February 27, 2004 11:40 WSPC/109-IJAIT 00146
Keyword Extraction Using Word Co-Occurrence Statistical Information 169
[11] Y. Ohsawa, N. E. Benson, and M. Yachida. KeyGraph: Automatic indexing by co-
occurrence graph based on building construction metaphor. In Proceedings of the
Advanced Digital Library Conference, 1998.
[12] Yukio Ohsawa. Chance discoveries for making decisions in complex real world. New
Generation Computing, to appear.
[13] F. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In
Proceedings of the 31th Meeting of the Association for Computational Linguistics,
pages 183–190, 1993.
[14] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130, 1980.
[15] M. Rajman and R. Besancon. Text mining knowledge extraction from unstructured
textual data. In Proceedings of the 6th Conference of International Federation of
Classification Societies, 1998.
[16] G. Salton. Automatic Text Processing. Addison-Wesley, 1988.
[17] H. Schutze and J. O. Pederson. A cooccurrence-based thesaurus and two applications
to information retrieval. Information Processing and Management, 33(3):307–318,
1997.
[18] K. Sparck-Jones. A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28(5):111, 1972.
[19] K. Tanaka and H. Iwasaki. Extraction of lexical translations from non-aligned cor-
pora. In Proceedings of the 16th International Conference on Computational Linguis-
tics, page 580, 1996.
[20] A. M. Turing. Computing machinery and intelligence. Mind, 59:433, 1950.
... A framework for responsive stages that consequently doles out questions labels is depicted in the article. To naturally relegate labels to requests made in any discussion, we propose a grouping methodology in light of a Document-Term matrix [16]. StackOverflow queries were used to develop and validate the classifier. ...
Article
Full-text available
Educational resources like question-and-answer websites like Stack Exchange and Quora are growing in popularity online. A large number of these gatherings depend on labeling, which includes a part marking a post with a suitable assortment of subjects that depict the post and make it more straightforward to find and sort. We give a multi-name order framework that naturally distinguishes clients' requests to upgrade the client experience. A straight SVM and a carefully selected portion of the researched highlight set are used to create a one-versus-rest classifier for a Stack Overflow dataset. By utilizing a subsample of the initial data that is restricted to 100 labels and at least 500 events of each label throughout the data, our characterization framework achieves an ideal F1 score of 62.35 percent.
... Reference-free evaluation Early keyphrase systems adopt human evaluation to avoid the need for references (Barker and Cornacchia, 2000;Matsuo and Ishizuka, 2004). Another branch of work evaluates keyphrases by their utility in downstream applications such as retrieval (Bracewell et al., 2005;Boudin and Gallina, 2021) or summarization (Litvak and Last, 2008). ...
Preprint
Full-text available
Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation only relies on exact matching with human references and disregards reference-free attributes. This scheme fails to recognize systems that generate keyphrases that are semantically equivalent to the references or keyphrases that have practical utility. To better understand the strengths and weaknesses of different keyphrase systems, we propose a comprehensive evaluation framework consisting of six critical dimensions: naturalness, faithfulness, saliency, coverage, diversity, and utility. For each dimension, we discuss the desiderata and design semantic-based metrics that align with the evaluation objectives. Rigorous meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously used metrics. Using this framework, we re-evaluate 18 keyphrase systems and further discover that (1) the best model differs in different dimensions, with pre-trained language models achieving the best in most dimensions; (2) the utility in downstream tasks does not always correlate well with reference-based metrics; and (3) large language models exhibit a strong performance in reference-free evaluation.
Article
Full-text available
Video is an increasingly important resource in higher education. A key limitation of lecture video is that it is fundamentally a sequential information stream. Quickly accessing the content aligned with specific learning objectives in a video recording of a classroom lecture is challenging. Recent research has enabled automatic reorganization of a lecture video into segments discussing different subtopics. This paper explores AI generation of visual and textual summaries of lecture video segments to improve navigation. A visual summary consists of a subset of images in the video segment that are considered the most unique and important by image analysis. A textual summary consists of a set of keywords selected from the screen text in the video segment by analyzing several factors including frequency, font size, time on screen, and existence in domain and language dictionaries. Evaluation was performed against keywords and summary images selected by human experts with the following results for the most relevant formulations. AI driven keyword selection yielded an F-1 score of 0.63 versus 0.26 for keywords sampled randomly from valid keyword candidates. AI driven visual summary yielded an F-1 score of 0.70 versus 0.59 for K-medoid clustering that is often employed for similar tasks. Surveys showed that 79% (72%) of the users agreed that a visual (textual) summary made a lecture video more useful. This framework is implemented in Videopoints, a real-world lecture video portal available to educational institutions.
Article
Despite the fast pace of technological development, the process of inventive ideation remains fuzzy. Meanwhile, improving innovation efficiency has become critical for research and development (R&D) teams because of the fierce competition. This study claimed that new technical phrases (NTPs) were important carriers of novel inventive ideas, and their formation was key to understanding and improving ideation processes. Therefore, this article proposed a methodology to analyse and predict the formation of NTPs. First, based on the recombinant search theory and link prediction, four variables in the prior co-word network of a phrase that may influence its formation were collected. Thereafter, logistic regression and a classification tree were employed on patent data to explore the effects of these variables on NTPs. Moreover, various machine learning methods were used for developing NTP prediction models, and procedures for applying the prediction models in real-world R&D settings were designed. Finally, a case study was conducted using the proposed methodology for its demonstration and validation in neural network technology. The case study revealed that all the four variables posed significant impact on the formation of NTPs, and the prediction models yielded the highest prediction accuracy of 78.6% on the test set. The proposed methodology would shed light on the ideation process in innovation theory and provide R&D teams with practical tools for generating new technical ideas.
Article
Human-in-the-loop topic models allow users to encode feedback to modify topic models without changing the core machinery of the topic models. Basic refinement functions have been proposed in prior works in which the main focus was to modify the top word lists of topics, e.g., add a single word in a topic having distribution over a large vocabulary set. In this work, we point out that such refinements have very little to no effect on document-topic associations, which are rather important in practical applications, and propose keyphrase-based refinement functions that are designed to improve document-topic associations efficiently. In the proposed method, these keyphrases are extracted by using a neural keyphrase generation model that summarizes a document in a few keyphrases which are human-interpretable representations of each given document. The proposed refinement functions are as simple as word-based refinements but directly modify the document-topic association in functionality by referring to the keyphrase representations of documents. To examine the capability of the refinement functions for revising topic models, we conducted experiments based on a simulated user that has a fixed preference for the document-topic association using 20Newsgroups dataset. Our results showed that the proposed keyphrase-based refinements outperform the basic word-based refinements in terms of the F1 score computed on the fixed preference.
Article
The continually developing Internet generates a considerable amount of text data. When attempting to extract general topics or themes from a massive corpus of documents, dealing with such a large volume of text data in an unstructured format is a big problem. Text document clustering (TDC) is a technique for grouping texts based on their content similarity. Partitioning text collection based on the documents’ content significance is one of the most challenging tasks at TDC. This study proposes the Bare-Bones Based Salp Swarm Algorithm (BBSSA) to solve the problem of TDC. In addition, to extract the topics from the clusters, an ensemble approach for automatic topic extraction (TE) is proposed. The proposed BBSSA and the ensemble TE approach are tested using six standard benchmarks and six scientific publishing datasets from top QS ranking UAE universities. BBSSA’s findings are compared with sixteen well-known techniques, including eleven metaheuristic algorithms, such as the Whale Optimization Algorithm (WOA), Firefly Algorithm (FFA), Bat Algorithm (BAT), Harmony Search (HS), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Multi-Verse Optimizer (MVO), Grey Wolf Optimizer (GWO), Moth-Flame Optimization (MFO), Krill Herd Algorithm (KHA), SSA, and five clustering methods, such as K-means++, K-means, Density-based Spatial Clustering of Applications with Noise (DBSCAN), Spectral, and Agglomerative. The results of the ensemble TE approach are compared with those of seven well-known statistical methods, including Mutual Information (MI), TextRank (TR), Co-Occurrence Statistical Information-based Keyword Extraction (CSI), Term Frequency-Inverse Document Frequency (TF-IDF), most frequent based keyword extraction (TF), YAKE!, and RAKE. According to the experiments, the BBSSA outperforms all other approaches and is exceedingly competitive. The results also reveal that for most datasets, the proposed ensemble TE strategy outperforms all existing TE methods based on external metrics. Thus, the ensemble TE approach can be seen as a supplement to the other methods.
Chapter
Text classification based on thresholds belongs to the supervised learning method which assigns text material to predefined classes or categories based on different thresholds with divergence approach. These categories are identified by a set of documents trained by an automated algorithm. This work presents an approach of text classification using an automatic keyword extraction algorithm based on the Kullback–Leibler divergence approach. The proposed method is evaluated on 2000 documents in Vietnamese, covering ten topics, collected from various e-journals and news portal Web sites including vietnamnet.vn, vnexpress.net, and so on to generate a completely new set of keywords. Such keywords, then, are leveraged to categorize the topic of new text documents. The obtained results verifying the practicality of our approach are feasible as well as outperform the state-of-the-art method.
Conference Paper
In this paper, we present an algorithm for extracting keywords representing the asserted main point in a document, without relying on external devices such as natural language processing tools or a document corpus. Our algorithm KeyGraph is based on the segmentation of a graph, representing the co-occurrence between terms in a document, into {\it clusters}. Each cluster corresponds to a concept on which author's idea is based, and top ranked terms by a statistic based on each term's relationship to these clusters are selected as keywords. This strategy comes from considering that a document is constructed like a building for expressing new ideas based on traditional concepts.The experimental results show that thus extracted terms match author's point quite accurately, even though KeyGraph does not use each term's average frequency in a corpus, i.e., KeyGraph is a content-sensitive, domain independent device of indexing.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Chapter
I propose to consider the question, “Can machines think?”♣ This should begin with definitions of the meaning of the terms “machine” and “think”. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, “Can machines think?” is to be sought in a statistical survey such as a Gallup poll.
Article
Written communication of ideas is carried out on the basis of statistical probability in that a writer chooses that level of subject specificity and that combination of words which he feels will convey the most meaning. Since this process varies among individuals and since similar ideas are therefore relayed at different levels of specificity and by means of different words, the problem of literature searching by machines still presents major difficulties. A statistical approach to this problem will be outlined and the various steps of a system based on this approach will be described. Steps include the statistical analysis of a collection of documents in a field of interest, the establishment of a set of “notions” and the vocabulary by which they are expressed, the compilation of a thesaurus-type dictionary and index, the automatic encoding of documents by machine with the aid of such a dictionary, the encoding of topological notations (such as branched structures), the recording of the coded information, the establishment of a searching pattern for finding pertinent information, and the programming of appropriate machines to carry out a search.
Article
This paper presents a new method for computing a thesaurus from a text corpus. Each word is represented as a vector in a multi-dimensional space that captures cooccurrence information. Words are defined to be similar if they have similar cooccurrence patterns. Two different methods for using these thesaurus vectors in information retrieval are shown to significantly improve performance over the Tipster reference corpus as compared to a term vector space baseline.