ChapterPDF Available

Abstract and Figures

This paper introduces a novel and domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. We describe the methods configuration parameters and algorithm, and present an evaluation on a benchmark corpus of technical abstracts. We also present a method for generating lists of stop words for specific corpora and domains, and evaluate its ability to improve keyword extraction on the benchmark corpus. Finally, we apply our method of automatic keyword extraction to a corpus of news articles and define metrics for characterizing the exclusivity, essentiality, and generality of extracted keywords within a corpus.
Content may be subject to copyright.
1
Automatic keyword extraction
from individual documents
Stuart Rose, Dave Engel, Nick Cramer
and Wendy Cowley
1.1 Introduction
Keywords, which we define as a sequence of one or more words, provide a
compact representation of a document’s content. Ideally, keywords represent in
condensed form the essential content of a document. Keywords are widely used
to define queries within information retrieval (IR) systems as they are easy to
define, revise, remember, and share. In comparison to mathematical signatures,
keywords are independent of any corpus and can be applied across multiple
corpora and IR systems.
Keywords have also been applied to improve the functionality of IR sys-
tems. Jones and Paynter (2002) describe Phrasier, a system that lists documents
related to a primary document’s keywords, and that supports the use of keyword
anchors as hyperlinks between documents, enabling a user to quickly access
related material. Gutwin et al. (1999) describe Keyphind, which uses keywords
from documents as the basic building block for an IR system. Keywords can also
be used to enrich the presentation of search results. Hulth (2004) describes Kee-
gle, a system that dynamically provides keyword extracts for web pages returned
from a Google search. Andrade and Valencia (1998) present a system that auto-
matically annotates protein function with keywords extracted from the scientific
literature that are associated with a given protein.
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
©2010, John Wiley & Sons, Ltd
COPYRIGHTED MATERIAL
4 TEXT MINING
1.1.1 Keyword extraction methods
Despite their utility for analysis, indexing, and retrieval, most documents do
not have assigned keywords. Most existing approaches focus on the manual
assignment of keywords by professional curators who may use a fixed taxonomy,
or rely on the authors’ judgment to provide a representative list. Research has
therefore focused on methods to automatically extract keywords from documents
as an aid either to suggest keywords for a professional indexer or to generate
summary features for documents that would otherwise be inaccessible.
Early approaches to automatically extract keywords focus on evaluating
corpus-oriented statistics of individual words. Jones (1972) and Salton et al.
(1975) describe positive results of selecting for an index vocabulary the
statistically discriminating words across a corpus. Later keyword extraction
research applies these metrics to select discriminating words as keywords for
individual documents. For example, Andrade and Valencia (1998) base their
approach on comparison of word frequency distributions within a text against
distributions from a reference corpus.
While some keywords are likely to be evaluated as statistically discriminating
within the corpus, keywords that occur in many documents within the corpus are
not likely to be selected as statistically discriminating. Corpus-oriented methods
also typically operate only on single words. This further limits the measurement of
statistically discriminating words because single words are often used in multiple
and different contexts.
To avoid these drawbacks, we focus our interest on methods of keyword
extraction that operate on individual documents. Such document-oriented
methods will extract the same keywords from a document regardless of the
current state of a corpus. Document-oriented methods therefore provide context-
independent document features, enabling additional analytic methods such as
those described in Engel et al. (2009) and Whitney et al. (2009) that characterize
changes within a text stream over time. These document-oriented methods are
suited to corpora that change, such as collections of published technical abstracts
that grow over time or streams of news articles. Furthermore, by operating on a
single document, these methods inherently scale to vast collections and can be
applied in many contexts to enrich IR systems and analysis tools.
Previous work on document-oriented methods of keyword extraction has com-
bined natural language processing approaches to identify part-of-speech (POS)
tags that are combined with supervised learning, machine-learning algorithms, or
statistical methods.
Hulth (2003) compares the effectiveness of three term selection approaches:
noun-phrase (NP) chunks, n-grams, and POS tags, with four discriminative fea-
tures of these terms as inputs for automatic keyword extraction using a supervised
machine-learning algorithm.
Mihalcea and Tarau (2004) describe a system that applies a series of syntactic
filters to identify POS tags that are used to select words to evaluate as key-
words. Co-occurrences of the selected words within a fixed-size sliding window
AUTOMATIC KEYWORD EXTRACTION 5
are accumulated within a word co-occurrence graph. A graph-based ranking
algorithm (TextRank) is applied to rank words based on their associations in
the graph, and then top ranking words are selected as keywords. Keywords that
are adjacent in the document are combined to form multi-word keywords. Mihal-
cea and Tarau (2004) report that TextRank achieves its best performance when
only nouns and adjectives are selected as potential keywords.
Matsuo and Ishizuka (2004) apply a chi-square measure to calculate how
selectively words and phrases co-occur within the same sentences as a particular
subset of frequent terms in the document text. The chi-square measure is applied
to determine the bias of word co-occurrences in the document text which is
then used to rank words and phrases as keywords of the document. Matsuo and
Ishizuka (2004) state that the degree of biases is not reliable when term frequency
is small. The authors present an evaluation on full text articles and a working
example on a 27-page document, showing that their method operates effectively
on large documents.
In the following sections, we describe Rapid Automatic Keyword Extrac-
tion (RAKE), an unsupervised, domain-independent, and language-independent
method for extracting keywords from individual documents. We provide details
of the algorithm and its configuration parameters, and present results on a bench-
mark dataset of technical abstracts, showing that RAKE is more computationally
efficient than TextRank while achieving higher precision and comparable recall
scores. We then describe a novel method for generating stoplists, which we use to
configure RAKE for specific domains and corpora. Finally, we apply RAKE to a
corpus of news articles and define metrics for evaluating the exclusivity, essential-
ity, and generality of extracted keywords, enabling a system to identify keywords
that are essential or general to documents in the absence of manual annotations.
1.2 Rapid automatic keyword extraction
In developing RAKE, our motivation has been to develop a keyword extraction
method that is extremely efficient, operates on individual documents to enable
application to dynamic collections, is easily applied to new domains, and operates
well on multiple types of documents, particularly those that do not follow specific
grammar conventions. Figure 1.1 contains the title and text for a typical abstract,
as well as its manually assigned keywords.
RAKE is based on our observation that keywords frequently contain multiple
words but rarely contain standard punctuation or stop words, such as the function
words and,the,andof , or other words with minimal lexical meaning. Reviewing
the manually assigned keywords for the abstract in Figure 1.1, there is only
one keyword that contains a stop word (of in set of natural numbers). Stop
words are typically dropped from indexes within IR systems and not included in
various text analyses as they are considered to be uninformative or meaningless.
This reasoning is based on the expectation that such words are too frequently
and broadly used to aid users in their analyses or search tasks. Words that do
6 TEXT MINING
Compatibility of systems of linear constraints over the set of natural numbers
Criteria of compatibility of a system of linear Diophantine equations, strict inequations,
and nonstrict inequations are considered. Upper bounds for components of a minimal set
of solutions and algorithms of construction of minimal generating sets of solutions for all
types of systems are given. These criteria and the corresponding algorithms for
constructing a minimal supporting set of solutions can be used in solving all the
considered types of systems and systems of mixed types.
Manually assigned keywords:
linear constraints, set of natural numbers, linear Diophantine equations, strict
inequations, nonstrict inequations, upper bounds, minimal generating sets
Figure 1.1 A sample abstract from the Inspec test set and its manually assigned
keywords.
carry meaning within a document are described as content bearing and are often
referred to as content words.
The input parameters for RAKE comprise a list of stop words (or stoplist), a
set of phrase delimiters, and a set of word delimiters. RAKE uses stop words and
phrase delimiters to partition the document text into candidate keywords, which
are sequences of content words as they occur in the text. Co-occurrences of words
within these candidate keywords are meaningful and allow us to identify word co-
occurrence without the application of an arbitrarily sized sliding window. Word
associations are thus measured in a manner that automatically adapts to the style
and content of the text, enabling adaptive and fine-grained measurement of word
co-occurrences that will be used to score candidate keywords.
1.2.1 Candidate keywords
RAKE begins keyword extraction on a document by parsing its text into a set of
candidate keywords. First, the document text is split into an array of words by the
specified word delimiters. This array is then split into sequences of contiguous
words at phrase delimiters and stop word positions. Words within a sequence are
assigned the same position in the text and together are considered a candidate
keyword.
Figure 1.2 shows the candidate keywords in the order that they are parsed
from the sample technical abstract shown in Figure 1.1. The candidate keyword
Compatibility – systems – linear constraints – set – natural numbers – Criteria –
compatibility – system – linear Diophantine equations – strict inequations – nonstrict
inequations – Upper bounds – components – minimal set – solutions – algorithms –
minimal generating sets – solutions – systems – criteria – corresponding algorithms –
constructing – minimal supporting set – solving – systems – systems
Figure 1.2 Candidate keywords parsed from the sample abstract.
AUTOMATIC KEYWORD EXTRACTION 7
linear Diophantine equations begins after the stop word of and ends with a
comma. The following word strict begins the next candidate keyword strict
inequations.
1.2.2 Keyword scores
After every candidate keyword is identified and the graph of word co-occurrences
(shown in Figure 1.3) is complete, a score is calculated for each candidate key-
word and defined as the sum of its member word scores. We evaluated several
metrics for calculating word scores, based on the degree and frequency of word
vertices in the graph: (1) word frequency (freq(w)), (2) word degree (deg(w)),
and (3) ratio of degree to frequency (deg(w)/freq(w)).
The metric scores for each of the content words in the sample abstract are
listed in Figure 1.4. In summary, deg(w) favors words that occur often and in
longer candidate keywords; deg(minimal) scores higher than deg(systems). Words
that occur frequently regardless of the number of words with which they co-occur
are favored by freq(w);freq(systems) scores higher than freq(minimal). Words that
predominantly occur in longer candidate keywords are favored by deg(w)/freq(w);
deg(diophantine)/freq(diophantine) scores higher than deg(linear)/freq(linear).
The score for each candidate keyword is computed as the sum of its member
algorithms
bounds
compatibility
components
constraints
constructing
corresponding
criteria
diophantine
equations
generating
inequations
linear
minimal
natural
nonstrict
numbers
set
sets
solving
strict
supporting
system
systems
upper
algorithms 21
bounds 1 1
compatibility 2
components 1
constraints 11
constructing 1
corresponding 11
criteria 2
diophantine 11
equations 1
1
11
generating 11 1
1
inequations 21 1
linear 1112
minimal 13 2 1
natural 11
nonstrict 11
numbers 11
set 231
sets 11 1
solving 1
strict 11
supporting 111
system 1
systems 4
upper 1 1
Figure 1.3 The word co-occurrence graph for content words in the sample
abstract.
8 TEXT MINING
algorithms
bounds
compatibility
components
constraints
constructing
corresponding
criteria
diophantine
equations
generating
inequations
linear
minimal
natural
nonstrict
numbers
set
sets
solving
strict
supporting
system
systems
upper
deg(w) 3221212233345822263123142
freq(w) 2121111211122311131111141
deg(w) / freq(w) 1.521121213332 2.5 2.722223123112
Figure 1.4 Word scores calculated from the word co-occurrence graph.
minimal generating sets (8.7), linear diophantine equations (8.5), minimal supporting set
(7.7), minimal set (4.7), linear constraints (4.5), natural numbers (4), strict inequations (4),
nonstrict inequations (4), upper bounds (4), corresponding algorithms (3.5), set (2),
algorithms (1.5), compatibility (1), systems (1), criteria (1), system (1), components
(1),constructing (1), solving (1)
Figure 1.5 Candidate keywords and their calculated scores.
word scores. Figure 1.5 lists each candidate keyword from the sample abstract
using the metric deg(w)/freq(w) to calculate individual word scores.
1.2.3 Adjoining keywords
Because RAKE splits candidate keywords by stop words, extracted keywords do
not contain interior stop words. While RAKE has generated strong interest due to
its ability to pick out highly specific terminology, an interest was also expressed
in identifying keywords that contain interior stop words such as axis of evil.To
find these RAKE looks for pairs of keywords that adjoin one another at least
twice in the same document and in the same order. A new candidate keyword is
then created as a combination of those keywords and their interior stop words.
The score for the new keyword is the sum of its member keyword scores.
It should be noted that relatively few of these linked keywords are extracted,
which adds to their significance. Because adjoining keywords must occur twice
in the same order within the document, their extraction is more common on texts
that are longer than short abstracts.
1.2.4 Extracted keywords
After candidate keywords are scored, the top Tscoring candidates are selected
as keywords for the document. We compute Tas one-third the number of words
in the graph, as in Mihalcea and Tarau (2004).
The sample abstract contains 28 content words, resulting in T=9key-
words. Table 1.1 lists the keywords extracted by RAKE compared to the sample
abstract’s manually assigned keywords. We use the statistical measures precision,
recall and F-measure to evaluate the accuracy of RAKE. Out of nine keywords
extracted, six are true positives; that is, they exactly match six of the manu-
ally assigned keywords. Although natural numbers is similar to the assigned
AUTOMATIC KEYWORD EXTRACTION 9
Table 1.1 Comparison of keywords extracted by RAKE to
manually assigned keywords for the sample abstract.
Extracted by RAKE Manually assigned
minimal generating sets minimal generating sets
linear diophantine equations linear Diophantine equations
minimal supporting set
minimal set
linear constraints linear constraints
natural numbers
strict inequations strict inequations
nonstrict inequations nonstrict inequations
upper bounds upper bounds
set of natural numbers
keyword set of natural numbers, for the purposes of the benchmark evaluation
it is considered a miss. There are therefore three false positives in the set of
extracted keywords, resulting in a precision of 67%. Comparing the six true
positives within the set of extracted keywords to the total of seven manually
assigned keywords results in a recall of 86%. Equally weighting precision and
recall generates an F-measure of 75%.
1.3 Benchmark evaluation
To evaluate performance we tested RAKE against a collection of technical
abstracts used in the keyword extraction experiments reported in Hulth (2003)
and Mihalcea and Tarau (2004), mainly for the purpose of allowing direct
comparison with their results.
1.3.1 Evaluating precision and recall
The collection consists of 2000 Inspec abstracts for journal papers from Computer
Science and Information Technology. The abstracts are divided into a training
set with 1000 abstracts, a validation set with 500 abstracts, and a testing set with
500 abstracts. We followed the approach described in Mihalcea and Tarau (2004),
using the testing set for evaluation because RAKE does not require a training
set. Extracted keywords for each abstract are compared against the abstract’s
associated set of manually assigned uncontrolled keywords.
Table 1.2 details RAKE’s performance using a generated stoplist, Fox’s sto-
plist (Fox 1989), and Tas one-third the number of words in the graph. For
each method, which corresponds to a row in the table, the following information
is shown: the total number of extracted keywords and mean per abstract; the
number of correct extracted keywords and mean per abstract; precision; recall;
and F-measure. Results published within Hulth (2003) and Mihalcea and Tarau
10 TEXT MINING
Table 1.2 Results of automatic keyword extraction on 500 abstracts in the
Inspec test set using RAKE, TextRank (Mihalcea and Tarau 2004) and
supervised learning (Hulth 2003).
Extracted Correct
keywords keywords
Method Total Mean Total Mean Precision Recall F-measure
RAKE (T =0.33)
KA stoplist (df >10) 6052 12.1 2037 4.1 33.7 41.5 37.2
Fox stoplist 7893 15.8 2054 4.2 26 42.2 32.1
TextRank
Undirected, co-occ.
window =2
6784 13.6 2116 4.2 31.2 43.1 36.2
Undirected, co-occ.
window =3
6715 13.4 1897 3.8 28.2 38.6 32.6
(Hulth 2003)
Ngram with tag 7815 15.6 1973 3.9 25.2 51.7 33.9
NP chunks with tag 4788 9.6 1421 2.8 29.7 37.2 33
Pattern with tag 7012 14 1523 3 21.7 39.9 28.1
the, and, of, a, in, is, for, to, we, this, are, with, as, on, it, an, that, which, by, using, can,
paper, from, be, based, has, was, have, or, at, such, also, but, results, proposed, show,
new, these, used, however, our, were, when, one, not, two, study, present, its, sub, both,
then, been, they, all, presented, if, each, approach, where, may, some, more, use,
between, into, 1, under, while, over, many, through, addition, well, first, will, there,
propose, than, their, 2, most, sup, developed, particular, provides, including, other, how,
without, during, article, application, only, called, what, since, order, experimental, any
Figure 1.6 Top 100 words in the generated stoplist.
(2004) are included for comparison. The highest values for precision, recall, and
F-measure are shown in bold. As noted, perfect precision is not possible with
any of the techniques as the manually assigned keywords do not always appear
in the abstract text. The highest precision and F-measure are achieved using
RAKE with a generated stoplist based on keyword adjacency, a subset of which
is listed in Figure 1.6. With this stoplist RAKE yields the best results in terms of
F-measure and precision, and provides comparable recall. With Fox’s stoplist,
RAKE achieves a high recall while experiencing a drop in precision.
1.3.2 Evaluating efficiency
Because of increasing interest in energy conservation in large data centers, we
also evaluated the computational cost associated with extracting keywords with
RAKE and TextRank. TextRank applies syntactic filters to a document text to
AUTOMATIC KEYWORD EXTRACTION 11
identify content words and accumulates a graph of word co-occurrences in a
window size of 2. A rank for each word in the graph is calculated through a
series of iterations until convergence below a threshold is achieved.
We set TextRank’s damping factor d=0.85 and its convergence threshold to
0.0001, as recommended in Mihalcea and Tarau (2004). We do not have access
to the syntactic filters referenced in Mihalcea and Tarau (2004), so were unable
to evaluate their computational cost.
To minimize disparity, all parsing stages in the respective extraction methods
are identical, TextRank accumulates co-occurrences in a window of size 2, and
RAKE accumulates word co-occurrences within candidate keywords. After co-
occurrences are tallied, the algorithms compute keyword scores according to their
respective methods. The benchmark was implemented in Java and executed in the
Java SE Runtime Environment (JRE) 6 on a Dell Precision T7400 workstation.
We calculated the total time for RAKE and TextRank (as an average over 100
iterations) to extract keywords from the Inspec testing set of 500 abstracts, after
the abstracts were read from files and loaded in memory. RAKE extracted key-
words from the 500 abstracts in 160 milliseconds. TextRank extracted keywords
in 1002 milliseconds, over 6 times the time of RAKE.
Referring to Figure 1.7, we can see that as the number of content words
for a document increases, the performance advantage of RAKE over TextRank
increases. This is due to RAKE’s ability to score keywords in a single pass
whereas TextRank requires repeated iterations to achieve convergence on
word ranks.
Based on this benchmark evaluation, it is clear that RAKE effectively extracts
keywords and outperforms the current state of the art in terms of precision, effi-
ciency, and simplicity. As RAKE can be put to use in many different systems and
applications, in the next section we discuss a method for stoplist generation that
may be used to configure RAKE on particular corpora, domains, and languages.
1.4 Stoplist generation
Stoplists are widely used in IR and text analysis applications. However, there is
remarkably little information describing methods for their creation. Fox (1989)
presents an analysis of stoplists, noting discrepancies between stated conven-
tions and actual instances and implementations of stoplists. The lack of tech-
nical rigor associated with the creation of stoplists presents a challenge when
comparing text analysis methods. In practice, stoplists are often based on com-
mon function words and hand-tuned for particular applications, domains, or
specific languages.
We evaluated the use of term frequency as a metric for automatically selecting
words for a stoplist. Table 1.3 lists the top 50 words by term frequency in the
training set of abstracts in the benchmark dataset. Additional metrics shown for
each word are document frequency, adjacency frequency, and keyword frequency.
Adjacency frequency reflects the number of times the word occurred adjacent to
12 TEXT MINING
0
1
2
3
4
5
6
7
10 20 30 40 50 60 70 80 90 100 110 120
Milliseconds
Number of Vertices in Word Co -occurrence Graph
Extraction Time by Document Size
RAKE
TextRank
Figure 1.7 Comparison of TextRank and RAKE extraction times on individual
documents.
an abstract’s keywords. Keyword frequency reflects the number of times the word
occurred within an abstract’s keywords.
Looking at the top 50 frequent words, in addition to the typical function
words, we can see that system,control,andmethod are highly frequent within
technical abstracts and highly frequent within the abstracts’ keywords. Selecting
solely by term frequency will therefore cause content-bearing words to be added
to the stoplist, particularly if the corpus of documents is focused on a particular
domain or topic. In those circumstances, selecting stop words by term frequency
presents a risk of removing important content-bearing words from analysis.
We therefore present the following method for automatically generating a
stoplist from a set of documents for which keywords are defined. The algorithm
is based on the intuition that words adjacent to, and not within, keywords are
less likely to be meaningful and therefore are good choices for stop words.
To generate our stoplist we identified for each abstract in the Inspec training
set the words occurring adjacent to words in the abstract’s uncontrolled key-
word list. The frequency of each word occurring adjacent to a keyword was
accumulated across the abstracts. Words that occurred more frequently within
keywords than adjacent to them were excluded from the stoplist.
AUTOMATIC KEYWORD EXTRACTION 13
Table 1.3 The 50 most frequent words in the Inspec training set listed in
descending order by term frequency.
Term Document Adjacency Keyword
Word frequency frequency frequency frequency
the 8611 978 3492 3
of 5546 939 1546 68
and 3644 911 2104 23
a 3599 893 1451 2
to 3000 879 792 10
in 2656 837 1402 7
is 1974 757 1175 0
for 1912 767 951 9
that 1129 590 330 0
with 1065 577 535 3
are 1049 576 555 1
this 964 581 645 0
on 919 550 340 8
an 856 501 332 0
we 822 388 731 0
by 773 475 283 0
as 743 435 344 0
be 595 395 170 0
it 560 369 339 13
system 507 255 86 202
can 452 319 250 0
based 451 293 168 15
from 447 309 187 0
using 428 282 260 0
control 409 166 12 237
which 402 280 285 0
paper 398 339 196 1
systems 384 194 44 191
method 347 188 78 85
data 347 159 39 131
time 345 201 24 95
model 343 157 37 122
information 322 153 18 151
or 315 218 146 0
s 314 196 27 0
have 301 219 149 0
has 297 225 166 0
at 296 216 141 0
new 294 197 93 4
two 287 205 83 5
(continued overleaf )
14 TEXT MINING
Tab le 1. 3 (Continued )
Term Document Adjacency Keyword
Word frequency frequency frequency frequency
algorithm 267 123 36 96
results 262 221 129 14
used 262 204 92 0
was 254 125 161 0
these 252 200 93 0
also 251 219 139 0
such 249 198 140 0
problem 234 137 36 55
design 225 110 38 68
To evaluate this method of generating stoplists, we created six stoplists, three
of which select words for the stoplist by term frequency (TF), and three which
select words by term frequency but also exclude words from the stoplist whose
keyword frequency was greater than their keyword adjacency frequency. We
refer to this latter set of stoplists as keyword adjacency (KA) stoplists since they
primarily include words that are adjacent to and not within keywords.
Table 1.4 Comparison of RAKE performance using stoplists based on term
frequency (TF) and keyword adjacency (KA).
Extracted Correct
keywords keywords
Stoplist
Method size Total Mean Total Mean Precision Recall F-measure
RAKE
(T=0.33)
TF stoplist
(df >10)
1347 3670 7.3 606 1.2 16.5 12.3 14.1
TF stoplist
(df >25)
527 5563 11.1 1032 2.1 18.6 21.0 19.7
TF stoplist
(df >50)
205 7249 14.5 1520 3.0 21.0 30.9 25.0
RAKE
(T=0.33)
KA stoplist
(df >10)
763 6052 12.1 2037 4.1 33.7 41.5 37.2
KA stoplist
(df >25)
325 7079 14.2 2103 4.3 29.7 42.8 35.1
KA stoplist
(df >50)
147 8013 16.0 2117 4.3 26.4 43.1 32.8
AUTOMATIC KEYWORD EXTRACTION 15
Each of the stoplists was set as the input stoplist for RAKE, which was
then run on the testing set of the Inspec corpus of technical abstracts. Table 1.4
lists the precision, recall, and F-measure for the keywords extracted by each
of these runs. The KA stoplists generated by our method outperformed the
TF stoplists generated by term frequency. A notable difference between results
achieved using the two types of stoplists is evident in Table 1.4: the F-measure
improves as more words are added to a KA stoplist, whereas when more words are
added to a TF stoplist the F-measure degrades. Furthermore, the best TF stoplist
underperforms the worst KA stoplist. This verifies that our algorithm for gener-
ating stoplists is adding the right stop words and excluding content words from
the stoplist.
Because the generated KA stoplists leverage manually assigned keywords, we
envision that an ideal application would be within existing digital libraries or IR
systems and collections where defined keywords exist or are easily identified for
a subset of the documents. Stoplists only need to be generated once for particular
domains, enabling RAKE to be applied to new and future articles, facilitating
the annotation and indexing of new documents.
1.5 Evaluation on news articles
While we have shown that a simple set of configuration parameters enables
RAKE to efficiently extract keywords from individual documents, it is worth
investigating how well extracted keywords represent the essential content within
a corpus of documents for which keywords have not been manually assigned.
The following section presents results on application of RAKE to the Multi-
Perspective Question Answering (MPQA) Corpus (CERATOPS 2009).
1.5.1 The MPQA Corpus
The MPQA Corpus consists of 535 news articles provided by the Center for the
Extraction and Summarization of Events and Opinions in Text (CERATOPS).
Articles in the MPQA Corpus are from 187 different foreign and US news sources
and date from June 2001 to May 2002.
1.5.2 Extracting keywords from news articles
We extracted keywords from title and text fields of documents in the MPQA
Corpus and set a minimum document threshold of two because we are interested
in keywords that are associated with multiple documents.
Candidate keyword scores were based on word scores as deg(w)/freq(w)
and as deg(w). Calculating word scores as deg(w)/freq(w), RAKE extracted 517
keywords referenced by an average of 4.9 documents. Calculating word scores
as deg(w), RAKE extracted 711 keywords referenced by an average of 8.1
documents.
16 TEXT MINING
This difference in average number of referenced document counts is the
result of longer keywords having lower frequency across documents. The metric
deg(w)/freq(w) favors longer keywords and therefore results in extracted key-
words that occur in fewer documents in the MPQA Corpus.
In many cases a subject is occasionally presented in its long form and more
frequently referenced in its shorter form. For example, referring to Table 1.5,
kyoto protocol on climate change and 1997 kyoto protocol occur less frequently
than the shorter kyoto protocol. Because our interest in the analysis of news
articles is to connect articles that reference related content, we set RAKE to
score words by deg(w) in order to favor shorter keywords that occur across more
documents.
Because most documents are unique within any given corpus, we expect to
find variability in what documents are essentially about as well as how each
document represents specific subjects. While some documents may be primarily
about the kyoto protocol,greenhouse gas emissions ,andclimate change,other
documents may only make references to those subjects. Documents in the former
set will likely have kyoto protocol,greenhouse gas emissions,andclimate change
extracted as keywords whereas documents in the latter set will not.
In many applications, users have a desire to capture all references to extracted
keywords. For the purposes of evaluating extracted keywords, we accumulate
Table 1.5 Keywords extracted with word scores by deg(w) and deg(w)/freq(w).
Scored by deg(w) Scored by deg(w)/
freq(w)
Keyword edf(w) rdf(w) edf(w) rdf(w)
kyoto protocol legally obliged
developed countries
222 2
eu leader urge russia to ratify
kyoto protocol
222 2
kyoto protocol on climate
change
222 2
ratify kyoto protocol 2 2 2 2
kyoto protocol requires 2 2 2 2
1997 kyoto protocol 2 4 4 4
kyoto protocol 31 44 7 44
kyoto 10 12
kyoto accord 3 3
kyoto pact 2 3
sign kyoto protocol 2 2
ratification of the kyoto
protocol
22
ratify the kyoto protocol 2 2
kyoto agreement 2 2
AUTOMATIC KEYWORD EXTRACTION 17
counts on how often each extracted keyword is referenced by documents in the
corpus. The referenced document frequency of a keyword, rdf(k), is the number of
documents in which the keyword occurred as a candidate keyword. The extracted
document frequency of a keyword, edf(k), is the number of documents from which
the keyword was extracted.
A keyword that is extracted from all of the documents in which it is refer-
enced can be characterized as exclusive or essential , whereas a keyword that is
referenced in many documents but extracted from a few may be characterized as
general. Comparing the relationship of edf(k) and rdf(k) allows us to characterize
the exclusivity of a particular keyword. We therefore define keyword exclusivity
exc(k) as shown in Equation (1.1):
exc(k) =
edf(k)
rdf(k ) .(1.1)
Of the 711 extracted keywords, 395 have an exclusivity score of 1, indicating
that they were extracted from every document in which they were referenced.
Within that set of 395 exclusive keywords, some occur in more documents than
others and can therefore be considered more essential to the corpus of documents.
In order to measure how essential a keyword is, we define the essentiality of a
keyword, ess(k), as shown in Equation (1.2):
ess(k) =exc(k) ×edf(k). (1.2)
Figure 1.8 lists the top 50 essential keywords extracted from the MPQA cor-
pus, listed in descending order by their ess(k) scores. According to CERATOPS,
the MPQA corpus comprises 10 primary topics, listed in Table 1.6, which are
well represented by the 50 most essential keywords as extracted and ranked by
RAKE.
In addition to keywords that are essential to documents, we can also char-
acterize keywords by how general they are to the corpus. In other words, how
united states (32), human rights (24), kyoto protocol (22), international space station (18),
mugabe (16), space station (14), human rights report (12), greenhouse gas emissions
(12), chavez (11), taiwan issue (11), president chavez (10), human rights violations (10),
president bush (10), palestinian people (10), prisoners of war (9), president hugo chavez
(9), kyoto (8), taiwan (8), israeli government (8), hugo chavez (8), climate change (8),
space (8), axis of evil (7), president fernando henrique cardoso (7), palestinian (7),
palestinian territories (6), taiwan strait (6), russian news agency interfax (6), prisoners (6),
taiwan relations act (6), president robert mugabe (6), presidential election (6), geneva
convention (5), palestinian authority (5), venezuelan president hugo chavez (5), chinese
president jiang zemin (5), opposition leader morgan tsvangirai (5), french news agency
afp (5), bush (5), north korea (5), camp x-ray (5), rights (5), election (5), mainland china
(5), al qaeda (5), president (4), south africa (4), global warming (4), bush administration
(4), mdc leader (4)
Figure 1.8 Top 50 essential keywords from the MPQA Corpus, with correspond-
ing ess(k) score in parentheses.
18 TEXT MINING
Table 1.6 MPQA Corpus topics and definitions.
Topic Description
argentina Economic collapse in Argentina
axisofevil Reaction to President Bush’s 2002 State of the Union Address
guantanamo US holding prisoners in Guantanamo Bay
humanrights Reaction to US State Department report on human rights
kyoto Ratification of Kyoto Protocol
mugabe 2002 Presidential election in Zimbabwe
settlements Israeli settlements in Gaza and West Bank
spacestation Space missions of various countries
taiwan Relations between Taiwan and China
venezuela Presidential coup in Venezuela
government (147), countries (141), people (125), world (105), report (91), war (85), united
states (79), china (71), president (69), iran (60), bush (56), japan (50), law (44), peace
(44), policy (43), officials (43), israel (41), zimbabwe (39), taliban (36), prisoners (35),
opposition (35), plan (35), president george (34), axis (34), administration (33), detainees
(32), treatment (32), states (30), european union (30), palestinians (30), election (29),
rights (28), international community (27), military (27), argentina (27), america (27),
guantanamo bay (26), official (26), weapons (24), source (24), eu (23), attacks (23),
united nations (22), middle east (22), bush administration (22), human rights (21), base
(20), minister (20), party (19), north korea (18)
Figure 1.9 Top 50 general keywords from the MPQA Corpus, with corresponding
gen(k) score in parentheses.
often was a keyword referenced by documents from which it was not extracted?
In this case we define generality of a keyword, gen(k), as shown in Equation
(1.3):
gen(k) =rdf (k) ×(1.0exc(k)). (1.3)
Figure 1.9 lists the top 50 general keywords extracted from the MPQA corpus,
listed in descending order by their gen(k) scores. It should be noted that general
keywords and essential keywords are not mutually exclusive. Within the top 50
for both metrics, there are several shared keywords: united states,president,
bush,prisoners,election ,rights,bush administration,human rights,andnorth
korea. Keywords that are both highly essential and highly general are essential
to a set of documents within the corpus but also referenced by a significantly
greater number of documents within the corpus than other keywords.
1.6 Summary
We have shown that our automatic keyword extraction technology, RAKE,
achieves higher precision and similar recall in comparison to existing techniques.
AUTOMATIC KEYWORD EXTRACTION 19
In contrast to methods that depend on natural language processing techniques
to achieve their results, RAKE takes a simple set of input parameters and
automatically extracts keywords in a single pass, making it suitable for a wide
range of documents and collections.
Finally, RAKE’s simplicity and efficiency enable its use in many applications
where keywords can be leveraged. Based on the variety and volume of existing
collections and the rate at which documents are created and collected, RAKE
provides advantages and frees computing resources for other analytic methods.
1.7 Acknowledgements
This work was supported by the National Visualization and Analytics Center
(NVAC), which is sponsored by the US Department of Homeland Security
Program and located at the Pacific Northwest National Laboratory (PNNL), and
by Laboratory Directed Research and Development at PNNL. PNNL is managed
for the US Department of Energy by Battelle Memorial Institute under Contract
DE-AC05-76RL01830.
We also thank Anette Hulth, for making available the dataset used in her
experiments.
References
Andrade M and Valencia A 1998 Automatic extraction of keywords from scientific
text: application to the knowledge domain of protein families. Bioinformatics 14(7),
600–607.
CERATOPS 2009 MPQA Corpus http://www.cs.pitt.edu/mpqa/ceratops/corpora.html.
Engel D, Whitney P, Calapristi A and Brockman F 2009 Mining for emerging technolo-
gies within text streams and documents. Proceedings of the Ninth SIAM International
Conference on Data Mining. Society for Industrial and Applied Mathematics.
Fox C 1989 A stop list for general text. ACM SIGIR Forum , vol. 24, pp. 19–21. ACM,
New York, USA.
Gutwin C, Paynter G, Witten I, Nevill-Manning C and Frank E 1999 Improving browsing
in digital libraries with keyphrase indexes. Decision Support Systems 27(1– 2), 81–104.
Hulth A 2003 Improved automatic keyword extraction given more linguistic knowledge.
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Pro-
cessing, vol. 10, pp. 216– 223 Association for Computational Linguistics, Morristown,
NJ, USA.
Hulth A 2004 Combining machine learning and natural language processing for automatic
keyword extraction. Stockholm University, Faculty of Social Sciences, Department of
Computer and Systems Sciences (together with KTH).
Jones K 1972 A statistical interpretation of term specificity and its application in retrieval.
Journal of Documentation 28(1), 11 21.
Jones S and Paynter G 2002 Automatic extraction of document keyphrases for use in
digital libraries: evaluation and applications. Journal of the American Society for Infor-
mation Science and Technology.
20 TEXT MINING
Matsuo Y and Ishizuka M 2004 Keyword extraction from a single document using word
co-occurrence statistical information. International Journal on Artificial Intelligence
Tools 13(1), 157–169.
Mihalcea R and Tarau P 2004 Textrank: Bringing order into texts. In Proceedings of
EMNLP 2004 (ed. Lin D and Wu D), pp. 404 411. Association for Computational
Linguistics, Barcelona, Spain.
Salton G, Wong A and Yang C 1975 A vector space model for automatic indexing.
Communications of the ACM 18(11), 613 620.
Whitney P, Engel D and Cramer N 2009 Mining for surprise events within text streams.
Proceedings of the Ninth SIAM International Conference on Data Mining, pp. 617– 627.
Society for Industrial and Applied Mathematics.
... We optimize the keyword predictor with maximum likelihood training and a next token prediction loss. Following Yao et al. (2019), we provide the labels for K i by extracting keywords from a ground truth training sentence s i using the RAKE algorithm (Rose et al., 2010) to train our keyword predictor. Note that our model allows generation of multiple keywords and thus provides the flexibility to choose a subset of them as the control signal to fit in the generation. ...
... Following our hypothesis, we first extract keywords K i from s i using RAKE (Rose et al., 2010) and then match K i with all knowledge triples in G. Transforming the retrieved triples into knowledge sentences gives us our set ofR i . We then take the sentence s i and s i−1 , concatenate them, and encode them using the Universal Sentence Encoder (USE) (Cer et al., 2018), a widely-used toolkit for semantic similarity, U s = U ([s i−1 , s i ]), where we denote the encoder of USE as U . ...
Preprint
Existing pre-trained large language models have shown unparalleled generative capabilities. However, they are not controllable. In this paper, we propose MEGATRON-CNTRL, a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base. Our framework consists of a keyword predictor, a knowledge retriever, a contextual knowledge ranker, and a conditional text generator. As we do not have access to ground-truth supervision for the knowledge ranker, we make use of weak supervision from sentence embedding. The empirical results show that our model generates more fluent, consistent, and coherent stories with less repetition and higher diversity compared to prior work on the ROC story dataset. We showcase the controllability of our model by replacing the keywords used to generate stories and re-running the generation process. Human evaluation results show that 77.5% of these stories are successfully controlled by the new keywords. Furthermore, by scaling our model from 124 million to 8.3 billion parameters we demonstrate that larger models improve both the quality of generation (from 74.5% to 93.0% for consistency) and controllability (from 77.5% to 91.5%).
... The objective of the present paper is to provide the state-of-the-art techniques in the field of automatic term extraction, and to compare three well-known Automatic Term Extraction tools, viz, RAKE [17], TerMine [12], TermRaider [18] vis-àvis RENT [19], one of the latest term extraction scheme proposed in recent time. The paper further focuses on identifying the challenges and future directions to enhance the study. ...
... Rapid Automatic Keyword Extraction (RAKE) [17] is a simple tool for automatic term extraction. Figure 1 shows the flow of algorithm for RAKE, explained in the following paragraphs. ...
Preprint
Agriculture is a key component in any country's development. Domain-specific knowledge resources serve to gain insight into the domain. Existing knowledge resources such as AGROVOC and NAL Thesaurus are developed and maintained by the domain experts. Population of terms into these knowledge resources can be automated by using automatic term extraction tools for processing unstructured agricultural text. Automatic term extraction is also a key component in many semantic web applications, such as ontology creation, recommendation systems, sentiment classification, query expansion among others. The primary goal of an automatic term extraction system is to maximize the number of valid terms and minimize the number of invalid terms extracted from the input set of documents. Despite its importance in various applications, the availability of online tools for the said purpose is rather limited. Moreover, the performance of the most popular ones among them varies significantly. As a consequence, selection of the right term extraction tool is perceived as a serious problem for different knowledge-based applications. This paper presents an analysis of three commonly used term extraction tools, viz. RAKE, TerMine, TermRaider and compares their performance in terms of precision and recall, vis-a-vis RENT, a more recent term extractor developed by these authors for agriculture domain.
... Keyword extraction algorithms that represent the keywords in terms of a subset of terms from the original text are available in the literature [131,158]. However, query log analysis research revealed, that over 71% of (user generated) search queries contain named entities [64] and named entities exhibited good performance in related work [109]. ...
Thesis
A plethora of resources made available via retrieval systems in digital libraries remains untapped in the so called long tail of the Web. These long-tail websites get considerably less visits than major Web hubs. Zero-effort queries ease the discovery of long-tail resources by proactively retrieving and presenting information based on a user’s context. However, zero-effort queries over existing digital library structures are challenging, since the underlying retrieval system is only accessible via an API. The information need must be expressed by a query, instead of optimizing the ranking between context and resources in the retrieval system directly. We address three research questions that arise from replacing the user information seeking process by zero-effort queries. Our first question addresses the transformation of a user query to an automatic query, derived from the context. We present means to 1) identify the relevant context on different levels of granularity, 2) derive an information need from the context via keyword extraction and personalization and 3) express this information need in a query scheme that avoids over- or under-specified queries. We address the cold start problem with an approach to bootstrap user profiles from social media, even for passive users. With the second question, we address the presentation of resources in zero-effort query scenarios, presenting guidelines for presentation interfaces in the browser and a visualization of the triadic relationship between context, query and results. QueryCrumbs, a compact query history visualization supports recalling information found in the past and exploratory search by visualizing qualitative and quantitative query similarity. Our last question addresses the gap between (simple) keyword queries and the representation of resources by rich and complex meta-data. We investigate and extend feature representation learning techniques centered around the skip-gram model with negative sampling. Finally, we present an approach to learn representations from network and text jointly that can cope with the partial absence of one modality. Experimental results show close to human performance of our zero-effort query and user profile generation approach and visualizations to be helpful in terms of transparency, efficiency and support for exploratory search. These results indicate that the proposed zero-effort query approach indeed eases the discovery of long-tail resources and the accompanying visualizations further facilitate this process. The joint representation model provides a first step to bridge the gap between query and resource representation and we plan to follow and investigate this route further in the future.
... Cue-phrases for Training and Automatic Evaluation: For training all models, we need cue phrases, which are, in principle, to be entered by a user. However, to scale model training, we automatically extracted cue phrases from the target sentences in the training set using the previously proposed RAKE algorithm (Rose et al., 2010). It is important to note that cue phrases can represent a variety of information, and many other methods can be used to extract them for training purposes. ...
Preprint
Automatically generating stories is a challenging problem that requires producing causally related and logical sequences of events about a topic. Previous approaches in this domain have focused largely on one-shot generation, where a language model outputs a complete story based on limited initial input from a user. Here, we instead focus on the task of interactive story generation, where the user provides the model mid-level sentence abstractions in the form of cue phrases during the generation process. This provides an interface for human users to guide the story generation. We present two content-inducing approaches to effectively incorporate this additional information. Experimental results from both automatic and human evaluations show that these methods produce more topically coherent and personalized stories compared to baseline methods.
... Our motivation is to use twostage generation to improve performance of GPT2 , so we do not design a new middle form. Specifically, we use the RAKE algorithm (Rose et al., 2010) 1 to extract keywords of story. According to (Yao et al., 2019) and the average lengths of stories in our corpus, we extract 10 keywords for each story. ...
Preprint
Story generation is a challenging task, which demands to maintain consistency of the plots and characters throughout the story. Previous works have shown that GPT2, a large-scale language model, has achieved good performance on story generation. However, we observe that several serious issues still exist in the stories generated by GPT2 which can be categorized into two folds: consistency and coherency. In terms of consistency, on one hand, GPT2 cannot guarantee the consistency of the plots explicitly. On the other hand, the generated stories usually contain coreference errors. In terms of coherency, GPT2 does not take account of the discourse relations between sentences of stories directly. To enhance the consistency and coherency of the generated stories, we propose a two-stage generation framework, where the first stage is to organize the story outline which depicts the story plots and events, and the second stage is to expand the outline into a complete story. Therefore the plots consistency can be controlled and guaranteed explicitly. In addition, coreference supervision signals are incorporated to reduce coreference errors and improve the coreference consistency. Moreover, we design an auxiliary task of discourse relation modeling to improve the coherency of the generated stories. Experimental results on a story dataset show that our model outperforms the baseline approaches in terms of both automatic metrics and human evaluation.
... • Off-the-shelf systems extract keywords for each sentence using the three off-the-shelf systems: YAKE (Campos et al., 2020) using statistical features (e.g., TF, IDF), RAKE (Rose et al., 2010) using graph-based features (e.g., word degree), and PositionRank (Florescu and Caragea, 2017) using position-based PageRank. Then we choose duplicate keywords by majority voting. ...
Preprint
Despite the recent success of contextualized language models on various NLP tasks, language model itself cannot capture textual coherence of a long, multi-sentence document (e.g., a paragraph). Humans often make structural decisions on what and how to say about before making utterances. Guiding surface realization with such high-level decisions and structuring text in a coherent way is essentially called a planning process. Where can the model learn such high-level coherence? A paragraph itself contains various forms of inductive coherence signals called self-supervision in this work, such as sentence orders, topical keywords, rhetorical structures, and so on. Motivated by that, this work proposes a new paragraph completion task PARCOM; predicting masked sentences in a paragraph. However, the task suffers from predicting and selecting appropriate topical content with respect to the given context. To address that, we propose a self-supervised text planner SSPlanner that predicts what to say first (content prediction), then guides the pretrained language model (surface realization) using the predicted content. SSPlanner outperforms the baseline generation models on the paragraph completion task in both automatic and human evaluation. We also find that a combination of noun and verb types of keywords is the most effective for content selection. As more number of content keywords are provided, overall generation quality also increases.
... Many prior works use off the shelf tools to first label stories with plan outlines, thus using external supervision for learning plot structures. For example, Yao et al. (2019) use the RAKE heuristic (Rose et al., 2010) to first identify the most important keyword in each sentence, and then use this to train a model in a supervised fashion. This approach leads to improved coherency and control, but creates a reliance on such heuristics and does not jointly learn anchor words along with the generator. ...
Preprint
Past work on story generation has demonstrated the usefulness of conditioning on a generation plan to generate coherent stories. However, these approaches have used heuristics or off-the-shelf models to first tag training stories with the desired type of plan, and then train generation models in a supervised fashion. In this paper, we propose a deep latent variable model that first samples a sequence of anchor words, one per sentence in the story, as part of its generative process. During training, our model treats the sequence of anchor words as a latent variable and attempts to induce anchoring sequences that help guide generation in an unsupervised fashion. We conduct experiments with several types of sentence decoder distributions: left-to-right and non-monotonic, with different degrees of restriction. Further, since we use amortized variational inference to train our model, we introduce two corresponding types of inference network for predicting the posterior on anchor words. We conduct human evaluations which demonstrate that the stories produced by our model are rated better in comparison with baselines which do not consider story plans, and are similar or better in quality relative to baselines which use external supervision for plans. Additionally, the proposed model gets favorable scores when evaluated on perplexity, diversity, and control of story via discrete plan.
... To evaluate the performance, we compare some previously introduced system with our proposed feature set and their combinations. Here Statistical feature-based (TF-IDF [46], RAKE [45] and KEA [57]), Graph-based (Text-Rank [34], WA-Rank [56], TSAKE [44], SG-Rank [10] and SIF-Rank [50]), Topic-based (LDA [64]), word embedding based methods (W2Vec [32] and Embed Rank [48]) and others (PP-Score [1]) are used as our baselines. It is evident from the Table that the proposed "Skip Gram + Word Sense + Morpheme + POS" model outperforms for keyphrase extraction task among all mentioned datasets. ...
Article
Full-text available
The internet changed the way that people communicate, and this has led to a vast amount of Text that is available in electronic format. It includes things like e-mail, technical and scientific reports, tweets, physician notes and military field reports. Providing key-phrases for these extensive text collections thus allows users to grab the essence of the lengthy contents quickly and helps to locate information with high efficiency. While designing a Keyword Extraction and Indexing system, it is essential to pick unique properties, called features. In this article, we proposed different unsupervised keyword extraction approaches, which is independent of the structure, size and domain of the documents. The proposed method relies on the novel and cognitive inspired set of standard, phrase, word embedding and external knowledge source features. The individual and selected feature results are reported through experimentation on four different datasets viz. SemEval, KDD, Inspec, and DUC. The selected (feature selection) and word embedding based features are the best features set to be used for keywords extraction and indexing among all mentioned datasets. That is the proposed distributed word vector with additional knowledge improves the results significantly over the use of individual features, combined features after feature selection and state-of-the-art. After successfully achieving the objective of developing various keyphrase extraction methods we also experimented it for document classification task.
... We further remove all stop-words from the lists. For the keyword extraction, we use an open source implementation [29] (version 1.0.4) of the RAKE algorithm [30]. Based on an input string, RAKE produces a set Figure 1: Overview of the process used to evaluate our approach. ...
Conference Paper
A software developer works on many tasks per day, frequently switching between these tasks back and forth. This constant churn of tasks makes it difficult for a developer to know the specifics of when they worked on what task, complicating task resumption, planning, retrospection, and reporting activities. In a first step towards an automated aid to this issue, we introduce a new approach to help identify the topic of work during an information seeking task — one of the most common types of tasks that software developers face — that is based on capturing the contents of the developer’s active window at regular intervals and creating a vector representation of key information the developer viewed. To evaluate our approach, we created a data set with multiple developers working on the same set of six information seeking tasks that we also make available for other researchers to investigate similar approaches. Our analysis shows that our approach enables: 1) segments of a developer’s work to be automatically associated with a task from a known set of tasks with average accuracy of 70.6%, and 2) a word cloud describing a segment of work that a developer can use to recognize a task with average accuracy of 67.9%.
Article
Full-text available
Automatic keyword extraction is the task of automatically selecting a small set of terms describing the content of a single document. That a keyword is extracted means that it is present verbatim in the document to which it is assigned. This dissertation discusses the development of an algorithm for automatic keyword extraction, and presents a number of experiments, in which the performance of the algorithm is incrementally improved. The approach taken is that of supervised machine learning, that is, prediction models are constructed from documents with known keywords. Before any learning can take place, the data must be pre-processed and represented. In the presented research, two problems concerning the representation for keyword extraction are tackled. Since a keyword may consist of more than one token, the first problem concerns where a keyword begins and ends in a running text, that is, how a candidate term is defined. In this dissertation, three term selection approaches are defined and evaluated. The first approach extracts all uni-, bi-, and trigrams, the second approach extracts all noun phrase chunks, while the third approach extracts all terms matching any of a number of empirically defined part-of-speech patterns. Since the majority of the extracted candidate terms are not keywords, the second problem concerns how these terms can be limited, to only keep those that are appropriate as keywords. In the presented research, four features for filtering the candidate terms are investigated. These are term frequency, inverse document frequency, relative position of the first occurrence, and the part-of-speech tag or tags assigned to the candidate term. The research presented in this dissertation is linguistically oriented in the sense that the output from natural language processing tools is a considerable factor both for the pre-processing of the data, as well as for the performance of the prediction models. Of the three term selection approaches, the best individual performance – as measured by keywords previously assigned by professional indexers – is achieved by the noun phrase chunk approach. The part-of-speech tag feature dramatically improves the performance of the models, independently of which term selection approach is applied. The highest performance is, however, achieved when the predictions of all three models are combined.
Conference Paper
Full-text available
This paper summarizes algorithms and analysis methodology for mining the evolving content in text streams. Text streams include news, press releases from organizations, speeches, Internet blogs, etc. These data are a fundamental source for detecting and characterizing strategic intent of individuals and organizations as well as for detecting abrupt or surprising events within communities. Specifically, an analyst may need to know if and when the topic within a text stream changes. Much of the current text feature methodology is focused on understanding and analyzing a single static collection of text documents. Corresponding analytic activities include summarizing the contents of the collection, grouping the documents based on similarity of content, and calculating concise summaries of the resulting groups. The approach reported here focuses on taking advantage of the temporal characteristics in a text stream to identify relevant features (such as change in content), and also on the analysis and algorithmic methodology to communicate these characteristics to a user. We present a variety of algorithms for detecting essential features within a text stream. A critical finding is that the characteristics used to identify features in a text stream are uncorrelated with the characteristics used to identify features in a static document collection. Our approach for communicating the information back to the user is to identify feature (word/phrase) groups. These resulting algorithms form the basis of developing software tools for a user to analyze and understand the content of text streams. We present analysis using both news information and abstracts from technical articles, and show how these algorithms provide understanding of the contents of these text streams.
Article
Text streams, collections of documents or messages that are generated and observed over time, are ubiquitous. Our research and development is targeted at developing algorithms to find and characterize changes in topic within text streams. To date, this research has demonstrated the ability to detect and describe 1) short duration, atypical events and 2) the emergence of longer term shifts in topical content. This technology has been applied to pre-defined temporally ordered document collections but is also suitable for application to near real-time textual data streams. The underlying event and emergence detection algorithms have been interfaced to an event detection software user interface named SURPRISE. This software provides an interactive graphical user interface and tools for manipulating and correlating the terms and scores identified by the algorithms. Additionally, SURPRISE has been interfaced with the IN-SPIRE text analytics tool to enable an analyst to evaluate the surprising or emerging terms via a visualization of the entire document collection. IN-SPIRE assists in the exploration of related topics, events and views currently based on single term events. The focus of this research is to contribute to detecting, and preventing, strategic surprise.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Article
A stop list, or negative dictionary is a device used in automatic indexing to filter out words that would make poor index terms. Traditionally stop lists are supposed to have included only the most frequently occurring words. In practice, however, stop lists have tended to include infrequently occurring words, and have not included many frequently occurring words. Infrequently occurring words seem to have been included because stop list compilers have not, for whatever reason, consulted empirical studies of word frequencies. Frequently occurring words seem to have been left out for the same reason, and also because many of them might still be important as index terms.This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.
This article describes an evaluation of the Kea automatic keyphrase extraction algorithm. Document keyphrases are conventionally used as concise descriptors of document content, and are increasingly used in novel ways, including document clustering, searching and browsing interfaces, and retrieval engines. However, it is costly and time consuming to manually assign keyphrases to documents, motivating the development of tools that automatically perform this function. Previous studies have evaluated Kea's performance by measuring its ability to identify author keywords and keyphrases, but this methodology has a number of well-known limitations. The results presented in this article are based on evaluations by human assessors of the quality and appropriateness of Kea keyphrases. The results indicate that, in general, Kea produces keyphrases that are rated positively by human assessors. However, typical Kea settings can degrade performance, particularly those relating to keyphrase length and domain specificity. We found that for some settings, Kea's performance is better than that of similar systems, and that Kea's ranking of extracted keyphrases is effective. We also determined that author-specified keyphrases appear to exhibit an inherent ranking, and that they are rated highly and therefore suitable for use in training and evaluation of automatic keyphrasing systems.
Article
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.