ChapterPDF Available

Automatic Keyword Extraction from Individual Documents

Authors:

Abstract and Figures

This paper introduces a novel and domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. We describe the methods configuration parameters and algorithm, and present an evaluation on a benchmark corpus of technical abstracts. We also present a method for generating lists of stop words for specific corpora and domains, and evaluate its ability to improve keyword extraction on the benchmark corpus. Finally, we apply our method of automatic keyword extraction to a corpus of news articles and define metrics for characterizing the exclusivity, essentiality, and generality of extracted keywords within a corpus.
Content may be subject to copyright.
1
Automatic keyword extraction
from individual documents
Stuart Rose, Dave Engel, Nick Cramer
and Wendy Cowley
1.1 Introduction
Keywords, which we define as a sequence of one or more words, provide a
compact representation of a document’s content. Ideally, keywords represent in
condensed form the essential content of a document. Keywords are widely used
to define queries within information retrieval (IR) systems as they are easy to
define, revise, remember, and share. In comparison to mathematical signatures,
keywords are independent of any corpus and can be applied across multiple
corpora and IR systems.
Keywords have also been applied to improve the functionality of IR sys-
tems. Jones and Paynter (2002) describe Phrasier, a system that lists documents
related to a primary document’s keywords, and that supports the use of keyword
anchors as hyperlinks between documents, enabling a user to quickly access
related material. Gutwin et al. (1999) describe Keyphind, which uses keywords
from documents as the basic building block for an IR system. Keywords can also
be used to enrich the presentation of search results. Hulth (2004) describes Kee-
gle, a system that dynamically provides keyword extracts for web pages returned
from a Google search. Andrade and Valencia (1998) present a system that auto-
matically annotates protein function with keywords extracted from the scientific
literature that are associated with a given protein.
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
©2010, John Wiley & Sons, Ltd
COPYRIGHTED MATERIAL
4 TEXT MINING
1.1.1 Keyword extraction methods
Despite their utility for analysis, indexing, and retrieval, most documents do
not have assigned keywords. Most existing approaches focus on the manual
assignment of keywords by professional curators who may use a fixed taxonomy,
or rely on the authors’ judgment to provide a representative list. Research has
therefore focused on methods to automatically extract keywords from documents
as an aid either to suggest keywords for a professional indexer or to generate
summary features for documents that would otherwise be inaccessible.
Early approaches to automatically extract keywords focus on evaluating
corpus-oriented statistics of individual words. Jones (1972) and Salton et al.
(1975) describe positive results of selecting for an index vocabulary the
statistically discriminating words across a corpus. Later keyword extraction
research applies these metrics to select discriminating words as keywords for
individual documents. For example, Andrade and Valencia (1998) base their
approach on comparison of word frequency distributions within a text against
distributions from a reference corpus.
While some keywords are likely to be evaluated as statistically discriminating
within the corpus, keywords that occur in many documents within the corpus are
not likely to be selected as statistically discriminating. Corpus-oriented methods
also typically operate only on single words. This further limits the measurement of
statistically discriminating words because single words are often used in multiple
and different contexts.
To avoid these drawbacks, we focus our interest on methods of keyword
extraction that operate on individual documents. Such document-oriented
methods will extract the same keywords from a document regardless of the
current state of a corpus. Document-oriented methods therefore provide context-
independent document features, enabling additional analytic methods such as
those described in Engel et al. (2009) and Whitney et al. (2009) that characterize
changes within a text stream over time. These document-oriented methods are
suited to corpora that change, such as collections of published technical abstracts
that grow over time or streams of news articles. Furthermore, by operating on a
single document, these methods inherently scale to vast collections and can be
applied in many contexts to enrich IR systems and analysis tools.
Previous work on document-oriented methods of keyword extraction has com-
bined natural language processing approaches to identify part-of-speech (POS)
tags that are combined with supervised learning, machine-learning algorithms, or
statistical methods.
Hulth (2003) compares the effectiveness of three term selection approaches:
noun-phrase (NP) chunks, n-grams, and POS tags, with four discriminative fea-
tures of these terms as inputs for automatic keyword extraction using a supervised
machine-learning algorithm.
Mihalcea and Tarau (2004) describe a system that applies a series of syntactic
filters to identify POS tags that are used to select words to evaluate as key-
words. Co-occurrences of the selected words within a fixed-size sliding window
AUTOMATIC KEYWORD EXTRACTION 5
are accumulated within a word co-occurrence graph. A graph-based ranking
algorithm (TextRank) is applied to rank words based on their associations in
the graph, and then top ranking words are selected as keywords. Keywords that
are adjacent in the document are combined to form multi-word keywords. Mihal-
cea and Tarau (2004) report that TextRank achieves its best performance when
only nouns and adjectives are selected as potential keywords.
Matsuo and Ishizuka (2004) apply a chi-square measure to calculate how
selectively words and phrases co-occur within the same sentences as a particular
subset of frequent terms in the document text. The chi-square measure is applied
to determine the bias of word co-occurrences in the document text which is
then used to rank words and phrases as keywords of the document. Matsuo and
Ishizuka (2004) state that the degree of biases is not reliable when term frequency
is small. The authors present an evaluation on full text articles and a working
example on a 27-page document, showing that their method operates effectively
on large documents.
In the following sections, we describe Rapid Automatic Keyword Extrac-
tion (RAKE), an unsupervised, domain-independent, and language-independent
method for extracting keywords from individual documents. We provide details
of the algorithm and its configuration parameters, and present results on a bench-
mark dataset of technical abstracts, showing that RAKE is more computationally
efficient than TextRank while achieving higher precision and comparable recall
scores. We then describe a novel method for generating stoplists, which we use to
configure RAKE for specific domains and corpora. Finally, we apply RAKE to a
corpus of news articles and define metrics for evaluating the exclusivity, essential-
ity, and generality of extracted keywords, enabling a system to identify keywords
that are essential or general to documents in the absence of manual annotations.
1.2 Rapid automatic keyword extraction
In developing RAKE, our motivation has been to develop a keyword extraction
method that is extremely efficient, operates on individual documents to enable
application to dynamic collections, is easily applied to new domains, and operates
well on multiple types of documents, particularly those that do not follow specific
grammar conventions. Figure 1.1 contains the title and text for a typical abstract,
as well as its manually assigned keywords.
RAKE is based on our observation that keywords frequently contain multiple
words but rarely contain standard punctuation or stop words, such as the function
words and,the,andof , or other words with minimal lexical meaning. Reviewing
the manually assigned keywords for the abstract in Figure 1.1, there is only
one keyword that contains a stop word (of in set of natural numbers). Stop
words are typically dropped from indexes within IR systems and not included in
various text analyses as they are considered to be uninformative or meaningless.
This reasoning is based on the expectation that such words are too frequently
and broadly used to aid users in their analyses or search tasks. Words that do
6 TEXT MINING
Compatibility of systems of linear constraints over the set of natural numbers
Criteria of compatibility of a system of linear Diophantine equations, strict inequations,
and nonstrict inequations are considered. Upper bounds for components of a minimal set
of solutions and algorithms of construction of minimal generating sets of solutions for all
types of systems are given. These criteria and the corresponding algorithms for
constructing a minimal supporting set of solutions can be used in solving all the
considered types of systems and systems of mixed types.
Manually assigned keywords:
linear constraints, set of natural numbers, linear Diophantine equations, strict
inequations, nonstrict inequations, upper bounds, minimal generating sets
Figure 1.1 A sample abstract from the Inspec test set and its manually assigned
keywords.
carry meaning within a document are described as content bearing and are often
referred to as content words.
The input parameters for RAKE comprise a list of stop words (or stoplist), a
set of phrase delimiters, and a set of word delimiters. RAKE uses stop words and
phrase delimiters to partition the document text into candidate keywords, which
are sequences of content words as they occur in the text. Co-occurrences of words
within these candidate keywords are meaningful and allow us to identify word co-
occurrence without the application of an arbitrarily sized sliding window. Word
associations are thus measured in a manner that automatically adapts to the style
and content of the text, enabling adaptive and fine-grained measurement of word
co-occurrences that will be used to score candidate keywords.
1.2.1 Candidate keywords
RAKE begins keyword extraction on a document by parsing its text into a set of
candidate keywords. First, the document text is split into an array of words by the
specified word delimiters. This array is then split into sequences of contiguous
words at phrase delimiters and stop word positions. Words within a sequence are
assigned the same position in the text and together are considered a candidate
keyword.
Figure 1.2 shows the candidate keywords in the order that they are parsed
from the sample technical abstract shown in Figure 1.1. The candidate keyword
Compatibility – systems – linear constraints – set – natural numbers – Criteria –
compatibility – system – linear Diophantine equations – strict inequations – nonstrict
inequations – Upper bounds – components – minimal set – solutions – algorithms –
minimal generating sets – solutions – systems – criteria – corresponding algorithms –
constructing – minimal supporting set – solving – systems – systems
Figure 1.2 Candidate keywords parsed from the sample abstract.
AUTOMATIC KEYWORD EXTRACTION 7
linear Diophantine equations begins after the stop word of and ends with a
comma. The following word strict begins the next candidate keyword strict
inequations.
1.2.2 Keyword scores
After every candidate keyword is identified and the graph of word co-occurrences
(shown in Figure 1.3) is complete, a score is calculated for each candidate key-
word and defined as the sum of its member word scores. We evaluated several
metrics for calculating word scores, based on the degree and frequency of word
vertices in the graph: (1) word frequency (freq(w)), (2) word degree (deg(w)),
and (3) ratio of degree to frequency (deg(w)/freq(w)).
The metric scores for each of the content words in the sample abstract are
listed in Figure 1.4. In summary, deg(w) favors words that occur often and in
longer candidate keywords; deg(minimal) scores higher than deg(systems). Words
that occur frequently regardless of the number of words with which they co-occur
are favored by freq(w);freq(systems) scores higher than freq(minimal). Words that
predominantly occur in longer candidate keywords are favored by deg(w)/freq(w);
deg(diophantine)/freq(diophantine) scores higher than deg(linear)/freq(linear).
The score for each candidate keyword is computed as the sum of its member
algorithms
bounds
compatibility
components
constraints
constructing
corresponding
criteria
diophantine
equations
generating
inequations
linear
minimal
natural
nonstrict
numbers
set
sets
solving
strict
supporting
system
systems
upper
algorithms 21
bounds 1 1
compatibility 2
components 1
constraints 11
constructing 1
corresponding 11
criteria 2
diophantine 11
equations 1
1
11
generating 11 1
1
inequations 21 1
linear 1112
minimal 13 2 1
natural 11
nonstrict 11
numbers 11
set 231
sets 11 1
solving 1
strict 11
supporting 111
system 1
systems 4
upper 1 1
Figure 1.3 The word co-occurrence graph for content words in the sample
abstract.
8 TEXT MINING
algorithms
bounds
compatibility
components
constraints
constructing
corresponding
criteria
diophantine
equations
generating
inequations
linear
minimal
natural
nonstrict
numbers
set
sets
solving
strict
supporting
system
systems
upper
deg(w) 3221212233345822263123142
freq(w) 2121111211122311131111141
deg(w) / freq(w) 1.521121213332 2.5 2.722223123112
Figure 1.4 Word scores calculated from the word co-occurrence graph.
minimal generating sets (8.7), linear diophantine equations (8.5), minimal supporting set
(7.7), minimal set (4.7), linear constraints (4.5), natural numbers (4), strict inequations (4),
nonstrict inequations (4), upper bounds (4), corresponding algorithms (3.5), set (2),
algorithms (1.5), compatibility (1), systems (1), criteria (1), system (1), components
(1),constructing (1), solving (1)
Figure 1.5 Candidate keywords and their calculated scores.
word scores. Figure 1.5 lists each candidate keyword from the sample abstract
using the metric deg(w)/freq(w) to calculate individual word scores.
1.2.3 Adjoining keywords
Because RAKE splits candidate keywords by stop words, extracted keywords do
not contain interior stop words. While RAKE has generated strong interest due to
its ability to pick out highly specific terminology, an interest was also expressed
in identifying keywords that contain interior stop words such as axis of evil.To
find these RAKE looks for pairs of keywords that adjoin one another at least
twice in the same document and in the same order. A new candidate keyword is
then created as a combination of those keywords and their interior stop words.
The score for the new keyword is the sum of its member keyword scores.
It should be noted that relatively few of these linked keywords are extracted,
which adds to their significance. Because adjoining keywords must occur twice
in the same order within the document, their extraction is more common on texts
that are longer than short abstracts.
1.2.4 Extracted keywords
After candidate keywords are scored, the top Tscoring candidates are selected
as keywords for the document. We compute Tas one-third the number of words
in the graph, as in Mihalcea and Tarau (2004).
The sample abstract contains 28 content words, resulting in T=9key-
words. Table 1.1 lists the keywords extracted by RAKE compared to the sample
abstract’s manually assigned keywords. We use the statistical measures precision,
recall and F-measure to evaluate the accuracy of RAKE. Out of nine keywords
extracted, six are true positives; that is, they exactly match six of the manu-
ally assigned keywords. Although natural numbers is similar to the assigned
AUTOMATIC KEYWORD EXTRACTION 9
Table 1.1 Comparison of keywords extracted by RAKE to
manually assigned keywords for the sample abstract.
Extracted by RAKE Manually assigned
minimal generating sets minimal generating sets
linear diophantine equations linear Diophantine equations
minimal supporting set
minimal set
linear constraints linear constraints
natural numbers
strict inequations strict inequations
nonstrict inequations nonstrict inequations
upper bounds upper bounds
set of natural numbers
keyword set of natural numbers, for the purposes of the benchmark evaluation
it is considered a miss. There are therefore three false positives in the set of
extracted keywords, resulting in a precision of 67%. Comparing the six true
positives within the set of extracted keywords to the total of seven manually
assigned keywords results in a recall of 86%. Equally weighting precision and
recall generates an F-measure of 75%.
1.3 Benchmark evaluation
To evaluate performance we tested RAKE against a collection of technical
abstracts used in the keyword extraction experiments reported in Hulth (2003)
and Mihalcea and Tarau (2004), mainly for the purpose of allowing direct
comparison with their results.
1.3.1 Evaluating precision and recall
The collection consists of 2000 Inspec abstracts for journal papers from Computer
Science and Information Technology. The abstracts are divided into a training
set with 1000 abstracts, a validation set with 500 abstracts, and a testing set with
500 abstracts. We followed the approach described in Mihalcea and Tarau (2004),
using the testing set for evaluation because RAKE does not require a training
set. Extracted keywords for each abstract are compared against the abstract’s
associated set of manually assigned uncontrolled keywords.
Table 1.2 details RAKE’s performance using a generated stoplist, Fox’s sto-
plist (Fox 1989), and Tas one-third the number of words in the graph. For
each method, which corresponds to a row in the table, the following information
is shown: the total number of extracted keywords and mean per abstract; the
number of correct extracted keywords and mean per abstract; precision; recall;
and F-measure. Results published within Hulth (2003) and Mihalcea and Tarau
10 TEXT MINING
Table 1.2 Results of automatic keyword extraction on 500 abstracts in the
Inspec test set using RAKE, TextRank (Mihalcea and Tarau 2004) and
supervised learning (Hulth 2003).
Extracted Correct
keywords keywords
Method Total Mean Total Mean Precision Recall F-measure
RAKE (T =0.33)
KA stoplist (df >10) 6052 12.1 2037 4.1 33.7 41.5 37.2
Fox stoplist 7893 15.8 2054 4.2 26 42.2 32.1
TextRank
Undirected, co-occ.
window =2
6784 13.6 2116 4.2 31.2 43.1 36.2
Undirected, co-occ.
window =3
6715 13.4 1897 3.8 28.2 38.6 32.6
(Hulth 2003)
Ngram with tag 7815 15.6 1973 3.9 25.2 51.7 33.9
NP chunks with tag 4788 9.6 1421 2.8 29.7 37.2 33
Pattern with tag 7012 14 1523 3 21.7 39.9 28.1
the, and, of, a, in, is, for, to, we, this, are, with, as, on, it, an, that, which, by, using, can,
paper, from, be, based, has, was, have, or, at, such, also, but, results, proposed, show,
new, these, used, however, our, were, when, one, not, two, study, present, its, sub, both,
then, been, they, all, presented, if, each, approach, where, may, some, more, use,
between, into, 1, under, while, over, many, through, addition, well, first, will, there,
propose, than, their, 2, most, sup, developed, particular, provides, including, other, how,
without, during, article, application, only, called, what, since, order, experimental, any
Figure 1.6 Top 100 words in the generated stoplist.
(2004) are included for comparison. The highest values for precision, recall, and
F-measure are shown in bold. As noted, perfect precision is not possible with
any of the techniques as the manually assigned keywords do not always appear
in the abstract text. The highest precision and F-measure are achieved using
RAKE with a generated stoplist based on keyword adjacency, a subset of which
is listed in Figure 1.6. With this stoplist RAKE yields the best results in terms of
F-measure and precision, and provides comparable recall. With Fox’s stoplist,
RAKE achieves a high recall while experiencing a drop in precision.
1.3.2 Evaluating efficiency
Because of increasing interest in energy conservation in large data centers, we
also evaluated the computational cost associated with extracting keywords with
RAKE and TextRank. TextRank applies syntactic filters to a document text to
AUTOMATIC KEYWORD EXTRACTION 11
identify content words and accumulates a graph of word co-occurrences in a
window size of 2. A rank for each word in the graph is calculated through a
series of iterations until convergence below a threshold is achieved.
We set TextRank’s damping factor d=0.85 and its convergence threshold to
0.0001, as recommended in Mihalcea and Tarau (2004). We do not have access
to the syntactic filters referenced in Mihalcea and Tarau (2004), so were unable
to evaluate their computational cost.
To minimize disparity, all parsing stages in the respective extraction methods
are identical, TextRank accumulates co-occurrences in a window of size 2, and
RAKE accumulates word co-occurrences within candidate keywords. After co-
occurrences are tallied, the algorithms compute keyword scores according to their
respective methods. The benchmark was implemented in Java and executed in the
Java SE Runtime Environment (JRE) 6 on a Dell Precision T7400 workstation.
We calculated the total time for RAKE and TextRank (as an average over 100
iterations) to extract keywords from the Inspec testing set of 500 abstracts, after
the abstracts were read from files and loaded in memory. RAKE extracted key-
words from the 500 abstracts in 160 milliseconds. TextRank extracted keywords
in 1002 milliseconds, over 6 times the time of RAKE.
Referring to Figure 1.7, we can see that as the number of content words
for a document increases, the performance advantage of RAKE over TextRank
increases. This is due to RAKE’s ability to score keywords in a single pass
whereas TextRank requires repeated iterations to achieve convergence on
word ranks.
Based on this benchmark evaluation, it is clear that RAKE effectively extracts
keywords and outperforms the current state of the art in terms of precision, effi-
ciency, and simplicity. As RAKE can be put to use in many different systems and
applications, in the next section we discuss a method for stoplist generation that
may be used to configure RAKE on particular corpora, domains, and languages.
1.4 Stoplist generation
Stoplists are widely used in IR and text analysis applications. However, there is
remarkably little information describing methods for their creation. Fox (1989)
presents an analysis of stoplists, noting discrepancies between stated conven-
tions and actual instances and implementations of stoplists. The lack of tech-
nical rigor associated with the creation of stoplists presents a challenge when
comparing text analysis methods. In practice, stoplists are often based on com-
mon function words and hand-tuned for particular applications, domains, or
specific languages.
We evaluated the use of term frequency as a metric for automatically selecting
words for a stoplist. Table 1.3 lists the top 50 words by term frequency in the
training set of abstracts in the benchmark dataset. Additional metrics shown for
each word are document frequency, adjacency frequency, and keyword frequency.
Adjacency frequency reflects the number of times the word occurred adjacent to
12 TEXT MINING
0
1
2
3
4
5
6
7
10 20 30 40 50 60 70 80 90 100 110 120
Milliseconds
Number of Vertices in Word Co -occurrence Graph
Extraction Time by Document Size
RAKE
TextRank
Figure 1.7 Comparison of TextRank and RAKE extraction times on individual
documents.
an abstract’s keywords. Keyword frequency reflects the number of times the word
occurred within an abstract’s keywords.
Looking at the top 50 frequent words, in addition to the typical function
words, we can see that system,control,andmethod are highly frequent within
technical abstracts and highly frequent within the abstracts’ keywords. Selecting
solely by term frequency will therefore cause content-bearing words to be added
to the stoplist, particularly if the corpus of documents is focused on a particular
domain or topic. In those circumstances, selecting stop words by term frequency
presents a risk of removing important content-bearing words from analysis.
We therefore present the following method for automatically generating a
stoplist from a set of documents for which keywords are defined. The algorithm
is based on the intuition that words adjacent to, and not within, keywords are
less likely to be meaningful and therefore are good choices for stop words.
To generate our stoplist we identified for each abstract in the Inspec training
set the words occurring adjacent to words in the abstract’s uncontrolled key-
word list. The frequency of each word occurring adjacent to a keyword was
accumulated across the abstracts. Words that occurred more frequently within
keywords than adjacent to them were excluded from the stoplist.
AUTOMATIC KEYWORD EXTRACTION 13
Table 1.3 The 50 most frequent words in the Inspec training set listed in
descending order by term frequency.
Term Document Adjacency Keyword
Word frequency frequency frequency frequency
the 8611 978 3492 3
of 5546 939 1546 68
and 3644 911 2104 23
a 3599 893 1451 2
to 3000 879 792 10
in 2656 837 1402 7
is 1974 757 1175 0
for 1912 767 951 9
that 1129 590 330 0
with 1065 577 535 3
are 1049 576 555 1
this 964 581 645 0
on 919 550 340 8
an 856 501 332 0
we 822 388 731 0
by 773 475 283 0
as 743 435 344 0
be 595 395 170 0
it 560 369 339 13
system 507 255 86 202
can 452 319 250 0
based 451 293 168 15
from 447 309 187 0
using 428 282 260 0
control 409 166 12 237
which 402 280 285 0
paper 398 339 196 1
systems 384 194 44 191
method 347 188 78 85
data 347 159 39 131
time 345 201 24 95
model 343 157 37 122
information 322 153 18 151
or 315 218 146 0
s 314 196 27 0
have 301 219 149 0
has 297 225 166 0
at 296 216 141 0
new 294 197 93 4
two 287 205 83 5
(continued overleaf )
14 TEXT MINING
Tab le 1. 3 (Continued )
Term Document Adjacency Keyword
Word frequency frequency frequency frequency
algorithm 267 123 36 96
results 262 221 129 14
used 262 204 92 0
was 254 125 161 0
these 252 200 93 0
also 251 219 139 0
such 249 198 140 0
problem 234 137 36 55
design 225 110 38 68
To evaluate this method of generating stoplists, we created six stoplists, three
of which select words for the stoplist by term frequency (TF), and three which
select words by term frequency but also exclude words from the stoplist whose
keyword frequency was greater than their keyword adjacency frequency. We
refer to this latter set of stoplists as keyword adjacency (KA) stoplists since they
primarily include words that are adjacent to and not within keywords.
Table 1.4 Comparison of RAKE performance using stoplists based on term
frequency (TF) and keyword adjacency (KA).
Extracted Correct
keywords keywords
Stoplist
Method size Total Mean Total Mean Precision Recall F-measure
RAKE
(T=0.33)
TF stoplist
(df >10)
1347 3670 7.3 606 1.2 16.5 12.3 14.1
TF stoplist
(df >25)
527 5563 11.1 1032 2.1 18.6 21.0 19.7
TF stoplist
(df >50)
205 7249 14.5 1520 3.0 21.0 30.9 25.0
RAKE
(T=0.33)
KA stoplist
(df >10)
763 6052 12.1 2037 4.1 33.7 41.5 37.2
KA stoplist
(df >25)
325 7079 14.2 2103 4.3 29.7 42.8 35.1
KA stoplist
(df >50)
147 8013 16.0 2117 4.3 26.4 43.1 32.8
AUTOMATIC KEYWORD EXTRACTION 15
Each of the stoplists was set as the input stoplist for RAKE, which was
then run on the testing set of the Inspec corpus of technical abstracts. Table 1.4
lists the precision, recall, and F-measure for the keywords extracted by each
of these runs. The KA stoplists generated by our method outperformed the
TF stoplists generated by term frequency. A notable difference between results
achieved using the two types of stoplists is evident in Table 1.4: the F-measure
improves as more words are added to a KA stoplist, whereas when more words are
added to a TF stoplist the F-measure degrades. Furthermore, the best TF stoplist
underperforms the worst KA stoplist. This verifies that our algorithm for gener-
ating stoplists is adding the right stop words and excluding content words from
the stoplist.
Because the generated KA stoplists leverage manually assigned keywords, we
envision that an ideal application would be within existing digital libraries or IR
systems and collections where defined keywords exist or are easily identified for
a subset of the documents. Stoplists only need to be generated once for particular
domains, enabling RAKE to be applied to new and future articles, facilitating
the annotation and indexing of new documents.
1.5 Evaluation on news articles
While we have shown that a simple set of configuration parameters enables
RAKE to efficiently extract keywords from individual documents, it is worth
investigating how well extracted keywords represent the essential content within
a corpus of documents for which keywords have not been manually assigned.
The following section presents results on application of RAKE to the Multi-
Perspective Question Answering (MPQA) Corpus (CERATOPS 2009).
1.5.1 The MPQA Corpus
The MPQA Corpus consists of 535 news articles provided by the Center for the
Extraction and Summarization of Events and Opinions in Text (CERATOPS).
Articles in the MPQA Corpus are from 187 different foreign and US news sources
and date from June 2001 to May 2002.
1.5.2 Extracting keywords from news articles
We extracted keywords from title and text fields of documents in the MPQA
Corpus and set a minimum document threshold of two because we are interested
in keywords that are associated with multiple documents.
Candidate keyword scores were based on word scores as deg(w)/freq(w)
and as deg(w). Calculating word scores as deg(w)/freq(w), RAKE extracted 517
keywords referenced by an average of 4.9 documents. Calculating word scores
as deg(w), RAKE extracted 711 keywords referenced by an average of 8.1
documents.
16 TEXT MINING
This difference in average number of referenced document counts is the
result of longer keywords having lower frequency across documents. The metric
deg(w)/freq(w) favors longer keywords and therefore results in extracted key-
words that occur in fewer documents in the MPQA Corpus.
In many cases a subject is occasionally presented in its long form and more
frequently referenced in its shorter form. For example, referring to Table 1.5,
kyoto protocol on climate change and 1997 kyoto protocol occur less frequently
than the shorter kyoto protocol. Because our interest in the analysis of news
articles is to connect articles that reference related content, we set RAKE to
score words by deg(w) in order to favor shorter keywords that occur across more
documents.
Because most documents are unique within any given corpus, we expect to
find variability in what documents are essentially about as well as how each
document represents specific subjects. While some documents may be primarily
about the kyoto protocol,greenhouse gas emissions ,andclimate change,other
documents may only make references to those subjects. Documents in the former
set will likely have kyoto protocol,greenhouse gas emissions,andclimate change
extracted as keywords whereas documents in the latter set will not.
In many applications, users have a desire to capture all references to extracted
keywords. For the purposes of evaluating extracted keywords, we accumulate
Table 1.5 Keywords extracted with word scores by deg(w) and deg(w)/freq(w).
Scored by deg(w) Scored by deg(w)/
freq(w)
Keyword edf(w) rdf(w) edf(w) rdf(w)
kyoto protocol legally obliged
developed countries
222 2
eu leader urge russia to ratify
kyoto protocol
222 2
kyoto protocol on climate
change
222 2
ratify kyoto protocol 2 2 2 2
kyoto protocol requires 2 2 2 2
1997 kyoto protocol 2 4 4 4
kyoto protocol 31 44 7 44
kyoto 10 12
kyoto accord 3 3
kyoto pact 2 3
sign kyoto protocol 2 2
ratification of the kyoto
protocol
22
ratify the kyoto protocol 2 2
kyoto agreement 2 2
AUTOMATIC KEYWORD EXTRACTION 17
counts on how often each extracted keyword is referenced by documents in the
corpus. The referenced document frequency of a keyword, rdf(k), is the number of
documents in which the keyword occurred as a candidate keyword. The extracted
document frequency of a keyword, edf(k), is the number of documents from which
the keyword was extracted.
A keyword that is extracted from all of the documents in which it is refer-
enced can be characterized as exclusive or essential , whereas a keyword that is
referenced in many documents but extracted from a few may be characterized as
general. Comparing the relationship of edf(k) and rdf(k) allows us to characterize
the exclusivity of a particular keyword. We therefore define keyword exclusivity
exc(k) as shown in Equation (1.1):
exc(k) =
edf(k)
rdf(k ) .(1.1)
Of the 711 extracted keywords, 395 have an exclusivity score of 1, indicating
that they were extracted from every document in which they were referenced.
Within that set of 395 exclusive keywords, some occur in more documents than
others and can therefore be considered more essential to the corpus of documents.
In order to measure how essential a keyword is, we define the essentiality of a
keyword, ess(k), as shown in Equation (1.2):
ess(k) =exc(k) ×edf(k). (1.2)
Figure 1.8 lists the top 50 essential keywords extracted from the MPQA cor-
pus, listed in descending order by their ess(k) scores. According to CERATOPS,
the MPQA corpus comprises 10 primary topics, listed in Table 1.6, which are
well represented by the 50 most essential keywords as extracted and ranked by
RAKE.
In addition to keywords that are essential to documents, we can also char-
acterize keywords by how general they are to the corpus. In other words, how
united states (32), human rights (24), kyoto protocol (22), international space station (18),
mugabe (16), space station (14), human rights report (12), greenhouse gas emissions
(12), chavez (11), taiwan issue (11), president chavez (10), human rights violations (10),
president bush (10), palestinian people (10), prisoners of war (9), president hugo chavez
(9), kyoto (8), taiwan (8), israeli government (8), hugo chavez (8), climate change (8),
space (8), axis of evil (7), president fernando henrique cardoso (7), palestinian (7),
palestinian territories (6), taiwan strait (6), russian news agency interfax (6), prisoners (6),
taiwan relations act (6), president robert mugabe (6), presidential election (6), geneva
convention (5), palestinian authority (5), venezuelan president hugo chavez (5), chinese
president jiang zemin (5), opposition leader morgan tsvangirai (5), french news agency
afp (5), bush (5), north korea (5), camp x-ray (5), rights (5), election (5), mainland china
(5), al qaeda (5), president (4), south africa (4), global warming (4), bush administration
(4), mdc leader (4)
Figure 1.8 Top 50 essential keywords from the MPQA Corpus, with correspond-
ing ess(k) score in parentheses.
18 TEXT MINING
Table 1.6 MPQA Corpus topics and definitions.
Topic Description
argentina Economic collapse in Argentina
axisofevil Reaction to President Bush’s 2002 State of the Union Address
guantanamo US holding prisoners in Guantanamo Bay
humanrights Reaction to US State Department report on human rights
kyoto Ratification of Kyoto Protocol
mugabe 2002 Presidential election in Zimbabwe
settlements Israeli settlements in Gaza and West Bank
spacestation Space missions of various countries
taiwan Relations between Taiwan and China
venezuela Presidential coup in Venezuela
government (147), countries (141), people (125), world (105), report (91), war (85), united
states (79), china (71), president (69), iran (60), bush (56), japan (50), law (44), peace
(44), policy (43), officials (43), israel (41), zimbabwe (39), taliban (36), prisoners (35),
opposition (35), plan (35), president george (34), axis (34), administration (33), detainees
(32), treatment (32), states (30), european union (30), palestinians (30), election (29),
rights (28), international community (27), military (27), argentina (27), america (27),
guantanamo bay (26), official (26), weapons (24), source (24), eu (23), attacks (23),
united nations (22), middle east (22), bush administration (22), human rights (21), base
(20), minister (20), party (19), north korea (18)
Figure 1.9 Top 50 general keywords from the MPQA Corpus, with corresponding
gen(k) score in parentheses.
often was a keyword referenced by documents from which it was not extracted?
In this case we define generality of a keyword, gen(k), as shown in Equation
(1.3):
gen(k) =rdf (k) ×(1.0exc(k)). (1.3)
Figure 1.9 lists the top 50 general keywords extracted from the MPQA corpus,
listed in descending order by their gen(k) scores. It should be noted that general
keywords and essential keywords are not mutually exclusive. Within the top 50
for both metrics, there are several shared keywords: united states,president,
bush,prisoners,election ,rights,bush administration,human rights,andnorth
korea. Keywords that are both highly essential and highly general are essential
to a set of documents within the corpus but also referenced by a significantly
greater number of documents within the corpus than other keywords.
1.6 Summary
We have shown that our automatic keyword extraction technology, RAKE,
achieves higher precision and similar recall in comparison to existing techniques.
AUTOMATIC KEYWORD EXTRACTION 19
In contrast to methods that depend on natural language processing techniques
to achieve their results, RAKE takes a simple set of input parameters and
automatically extracts keywords in a single pass, making it suitable for a wide
range of documents and collections.
Finally, RAKE’s simplicity and efficiency enable its use in many applications
where keywords can be leveraged. Based on the variety and volume of existing
collections and the rate at which documents are created and collected, RAKE
provides advantages and frees computing resources for other analytic methods.
1.7 Acknowledgements
This work was supported by the National Visualization and Analytics Center
(NVAC), which is sponsored by the US Department of Homeland Security
Program and located at the Pacific Northwest National Laboratory (PNNL), and
by Laboratory Directed Research and Development at PNNL. PNNL is managed
for the US Department of Energy by Battelle Memorial Institute under Contract
DE-AC05-76RL01830.
We also thank Anette Hulth, for making available the dataset used in her
experiments.
References
Andrade M and Valencia A 1998 Automatic extraction of keywords from scientific
text: application to the knowledge domain of protein families. Bioinformatics 14(7),
600–607.
CERATOPS 2009 MPQA Corpus http://www.cs.pitt.edu/mpqa/ceratops/corpora.html.
Engel D, Whitney P, Calapristi A and Brockman F 2009 Mining for emerging technolo-
gies within text streams and documents. Proceedings of the Ninth SIAM International
Conference on Data Mining. Society for Industrial and Applied Mathematics.
Fox C 1989 A stop list for general text. ACM SIGIR Forum , vol. 24, pp. 19–21. ACM,
New York, USA.
Gutwin C, Paynter G, Witten I, Nevill-Manning C and Frank E 1999 Improving browsing
in digital libraries with keyphrase indexes. Decision Support Systems 27(1– 2), 81–104.
Hulth A 2003 Improved automatic keyword extraction given more linguistic knowledge.
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Pro-
cessing, vol. 10, pp. 216– 223 Association for Computational Linguistics, Morristown,
NJ, USA.
Hulth A 2004 Combining machine learning and natural language processing for automatic
keyword extraction. Stockholm University, Faculty of Social Sciences, Department of
Computer and Systems Sciences (together with KTH).
Jones K 1972 A statistical interpretation of term specificity and its application in retrieval.
Journal of Documentation 28(1), 11 21.
Jones S and Paynter G 2002 Automatic extraction of document keyphrases for use in
digital libraries: evaluation and applications. Journal of the American Society for Infor-
mation Science and Technology.
20 TEXT MINING
Matsuo Y and Ishizuka M 2004 Keyword extraction from a single document using word
co-occurrence statistical information. International Journal on Artificial Intelligence
Tools 13(1), 157–169.
Mihalcea R and Tarau P 2004 Textrank: Bringing order into texts. In Proceedings of
EMNLP 2004 (ed. Lin D and Wu D), pp. 404 411. Association for Computational
Linguistics, Barcelona, Spain.
Salton G, Wong A and Yang C 1975 A vector space model for automatic indexing.
Communications of the ACM 18(11), 613 620.
Whitney P, Engel D and Cramer N 2009 Mining for surprise events within text streams.
Proceedings of the Ninth SIAM International Conference on Data Mining, pp. 617– 627.
Society for Industrial and Applied Mathematics.
... Therefore, it is necessary to assign key phrases contained in a document to other documents as labels. Typical keyword extraction methods [1]- [3] cannot be used when attempting to achieve this goal because they extract keywords from the words in a document; thus, it is impossible to assign keywords that are not included in the document as labels. Named entity extraction [4], [5] cannot be used because it extracts key phrases from a document, whether it is supervised or unsupervised. ...
... TOT outputs θ, φ, and ψ for {Δ P i , T P i }. For an input set family, we obtain the output set family ({θ 1 ...
Article
The Topics over Time (TOT) model allows users to be aware of changes in certain topics over time. The proposed method inputs the divided dataset of security blog posts based on a fixed period using an overlap period to the TOT. The results suggest the extraction of topics that include malware and attack campaign names that are appropriate for the multi-labeling of cyber threat intelligence reports.
... La première approche, et la plus utilisée, concerne la suppression en rejetant les mots apparaissant dans une liste compilée de mots vides dépendant de la langue. Certains schémas (Lo et al., 2005;Rose et al., 2010) sont également proposés pour la génération automatique de listes de mots vides. On peut également personnaliser une liste de mots vides en fonction des mots qui peuvent être intéressants pour une tâche souhaitée. ...
Thesis
Au cours de ces dernières années, l’information au sens large est devenue la pièce maîtresse pour révolutionner les projets de transformation numérique. Encore faut-il savoir l’exploiter d’une manière intelligente pour en tirer tous les bénéfices. L’informatisation des données textuelles concerne plusieurs secteurs d’activité, en particulier le domaine médical. Aujourd’hui, la médecine moderne est devenue presque inconcevable sans l’utilisation des données numériques, qui ont fortement affecté la compréhension scientifique des maladies. Par ailleurs, ces dernières années, les données médicales sont devenues de plus en plus complexes en raison de leur croissance exponentielle. Cette forte croissance engendre une quantité de données importante qui ne permet pas d’effectuer une lecture humaine complète dans un délai raisonnable. Ainsi, les professionnels de santé reconnaissent l’importance des outils informatiques pour identifier des modèles informatifs ou prédictifs à travers le traitement et l’analyse automatiques des données médicales. Notre thèse s’inscrit dans le cadre du projet ConSoRe, et vise à créer des cohortes de patients résistants aux traitements anticancéreux. L’identification de ces résistances nous permet de mettre en place des modèles de prédiction des éventuels risques qui pourraient apparaître pendant le traitement des patients, et nous facilite l’individualisation et le renforcement de la prévention en fonction du niveau de risque estimé. Cette démarche s’inscrit dans le cadre d’une médecine de précision, permettant de proposer de nouvelles solutions thérapeutiques adaptées à la fois aux caractéristiques de la maladie (cancer) et aux profils des patients identifiés. Pour répondre à ces problématiques, nous présentons dans ce manuscrit nos différentes contributions. Notre première contribution consiste en une approche séquentielle permettant de traiter les différents problèmes liés au pré-traitement et à la préparation des données textuelles. La complexité de ces tâches réside essentiellement dans la qualité et la nature de ces textes, et est liée étroitement aux particularités des comptes rendus médicaux traités. Outre les opérations de linguistiques standards telles que la tokenisation ou la segmentation en phrases, nous présentons un arsenal de techniques assez large pour la préparation et le nettoyage des données. Notre deuxième contribution consiste en une approche de classification automatique des phrases extraites des comptes rendus médicaux. Cette approche est constituée essentiellement de deux étapes. La première consiste à entraîner les vecteurs de mots pour représenter les textes de façon à extraire le plus de caractéristiques possibles. La seconde étape est une classification automatique de phrases selon leurs informations sémantiques. Nous étudions pour cela les différents algorithmes d’apprentissage automatique (classique et profond) qui fournissent les meilleures performances sur nos données, et nous présentons notre meilleur algorithme. Notre troisième et dernière contribution majeure est consacrée à notre approche de modélisation des résistances aux traitements d’oncologie. Pour cela, nous présentons deux modèles de structuration des données. Le premier modèle nous permet de structurer les informations identifiées au niveau de chaque document (ou compte rendu). Le second modèle est quant à lui introduit au niveau patient, et permet à partir des informations extraites dans plusieurs comptes rendus d’un même patient, reconstruire son parcours néoplasique. Cette structuration permet d’identifier les réponses aux traitements et les toxicités, qui constituent des composants élémentaires pour notre approche de modélisation des résistances aux traitements d’oncologie.
... These words are given a higher priority to be the part of summary. RAKE [Rapid Automatic Keyword Extraction][22] is used to extract these words from the corpus. 6. Proper noun: This feature gives the presence of proper noun in the sentence. ...
Article
Full-text available
In today’s world as the data on the web is increasing it becomes a challenge to identify the relevant information. Automatic text summarization (ATS) provides a significant answer to it. In this paper, fuzzy logic and shark smell optimization (SSO) based algorithm for extractive text summarization is proposed. Shark Smell Optimization has been used to assign a weight to eight different features to identify the less and more important text features of text summarization. Then, the Fuzzy Logic’s inference system is utilized to generate fuzzy rules, and finally an automated summary is generated. The system generated summaries have been tested against the reference summaries from the DUC 2002, DUC 2003, DUC-2004 and TAC-11 dataset and ROUGE toolkit has been used for the evaluation of the proposed solution. Results of the proposed algorithm are compared against traditional methods and the rouge score suggested that the proposed algorithm generates better results than other methods.
... One of them is Rapid Automatic Keyword extraction (RAKe) algorithm. RAKe is an algorithm to automatically extract keywords from documents [10]. ...
Article
The given work considers the existing methods of text compression (finding keywords or creating summary) using RAKE, Lex Rank, Luhn, LSA, Text Rank algorithms; image generation; text-to-image and image-to-image translation including GANs (generative adversarial networks). Different types of GANs were described such as StyleGAN, GauGAN, Pix2Pix, CycleGAN, BigGAN, AttnGAN. This work aims to show ways to create illustrations for the text. First, key information should be obtained from the text. Second, this key information should be transformed into images. There were proposed several ways to transform keywords to images: generating images or selecting them from a dataset with further transforming like generating new images based on selected ow combining selected images e.g. with applying style from one image to another. Based on results, possibilities for further improving the quality of image generation were also planned: combining image generation with selecting images from a dataset, limiting topics of image generation.
... Another promising study by Zhao [68] provides targeted feedback based on misconceptions in online learning environments using key phrase extraction and semantic embedding using different stateof-the-art Natural Language Process (NLP) techniques. Three types of text segmentation were used to extract key points: statistics-based key phrase extraction, graph-based key phrase extraction such as RAKE (Rapid Automatic Keyword Extraction [55]), and semantic-based key phrase extraction based on pre-trained language BERT (Bidirectional Encoder Representations from Transformers) models. Although, not directly develop for assessment of analogical reasoning, "The Retriever" tool develop by [21] for creative idea generation based on analogical reasoning and ontology could be explored to identify evidence of this type of reasoning in students work. ...
Article
This article provides a review of the state of the art of technologies in providing automated feedback toopen-ended student work on complex problems. It includes a description of the nature of complex problems and elements of effective feedback in the context of engineering education. Existing technologies based on traditional machine learning methods and deep learning methods are compared in light of the cognitive skills, transfer skills and student performance expected in a complex problemsolving setting. Areas of interest for future research are identified.
... We must therefore confirm keywords that are a combination of words. In the UDPipe package, we can identify keywords in the text by following three methods: rapid automatic keyword extraction (RAKE; Rose et al., 2010), collocation ordering using pointwise mutual information (PMI; Church & Hanks, 1990), and parts of speech phrase sequence detection. Therefore, we used these three methods to identify keywords in the text. ...
Chapter
Strategic communication is becoming more relevant in communication sciences, though it needs to deepen its reflective practices, especially considering its potential in a VUCA world — volatile, uncertain, complex and ambiguous. The capillary, holistic and result-oriented nature that portrays this scientific field has led to the imperative of expanding knowledge about the different approaches, methodologies and impacts in all kinds of organisations when strategic communication is applied. Therefore Strategic Communication in Context: Theoretical Debates and Applied Research assembles several studies and essays by renowned authors who explore the topic from different angles, thus testing the elasticity of the concept. Moreover, this group of authors represents various schools of thought and geographies, making this book particularly rich and cross-disciplinary.
Article
We propose a method for scientific terms extraction from the texts in Russian based on weakly supervised learning. This approach doesn't require a large amount of hand-labeled data. To implement this method we collected a list of terms in a semi-automatic way and then annotated texts of scientific articles with these terms. These texts we used to train a model. Then we used predictions of this model on another part of the text collection to extend the train set. The second model was trained on both text collections: annotated with a dictionary and by a second model. Obtained results showed that giving additional data, annotated even in an automatic way, improves the quality of scientific terms extraction.
Chapter
Emerging topics, which often originate from the collaboration of two scientific subfields, can be represented by biterms (pairs of terms) where each term represents a distinct subfield. However, it is challenging to automatically find such two critical terms to represent an emerging topic exactly. First, existing term weighting models (such as TF-IDF, TextRank, RAKE, KECNW, and YAKE) may be effective for finding critical single-terms but not for critical biterms. Second, a potential biterm that may be suitable to represent the emerging topic has very low occurrences in a text (e.g., a corpus comprised of paper titles). So, even we combine two terms to generate a bag of biterms, the above term weighting models are still invalid, which will filter out these rare potential biterms. This paper proposes a novel Emerging Topic BiTerm Rank (ETBTRank) model to help automatically extract biterms for representing emerging topics, distinguishing emerging-topic biterms from unimportant biterms. In ETBTRank, we separately weigh the two terms in a biterm and find the emerging-topic biterms by a rule: if a biterm itself is rare, but each of the two terms in it has a high weight, then it is an emerging topic biterm. Experimental studies on paper title datasets demonstrate the effectiveness of the proposed model.
Chapter
Since its inception in 2004 the Archive of Formal Proofs has grown in size but its interface and functionality have only been minimally improved. To transform the AFP into a more user-friendly and effective resource, we redesigned the website to meet modern web standards and practices. We ensure that our work is community-driven by basing the redesign on results from a survey of the Isabelle community. The site generation uses Hugo and is implemented as a proper Isabelle component, which also allows us to adapt the AFP metadata model to avoid inconsistencies in the future. Notable improvements include a responsive design, new theory browsing interface, integrated search, and enhanced navigation.
Article
Full-text available
Automatic keyword extraction is the task of automatically selecting a small set of terms describing the content of a single document. That a keyword is extracted means that it is present verbatim in the document to which it is assigned. This dissertation discusses the development of an algorithm for automatic keyword extraction, and presents a number of experiments, in which the performance of the algorithm is incrementally improved. The approach taken is that of supervised machine learning, that is, prediction models are constructed from documents with known keywords. Before any learning can take place, the data must be pre-processed and represented. In the presented research, two problems concerning the representation for keyword extraction are tackled. Since a keyword may consist of more than one token, the first problem concerns where a keyword begins and ends in a running text, that is, how a candidate term is defined. In this dissertation, three term selection approaches are defined and evaluated. The first approach extracts all uni-, bi-, and trigrams, the second approach extracts all noun phrase chunks, while the third approach extracts all terms matching any of a number of empirically defined part-of-speech patterns. Since the majority of the extracted candidate terms are not keywords, the second problem concerns how these terms can be limited, to only keep those that are appropriate as keywords. In the presented research, four features for filtering the candidate terms are investigated. These are term frequency, inverse document frequency, relative position of the first occurrence, and the part-of-speech tag or tags assigned to the candidate term. The research presented in this dissertation is linguistically oriented in the sense that the output from natural language processing tools is a considerable factor both for the pre-processing of the data, as well as for the performance of the prediction models. Of the three term selection approaches, the best individual performance – as measured by keywords previously assigned by professional indexers – is achieved by the noun phrase chunk approach. The part-of-speech tag feature dramatically improves the performance of the models, independently of which term selection approach is applied. The highest performance is, however, achieved when the predictions of all three models are combined.
Conference Paper
Full-text available
This paper summarizes algorithms and analysis methodology for mining the evolving content in text streams. Text streams include news, press releases from organizations, speeches, Internet blogs, etc. These data are a fundamental source for detecting and characterizing strategic intent of individuals and organizations as well as for detecting abrupt or surprising events within communities. Specifically, an analyst may need to know if and when the topic within a text stream changes. Much of the current text feature methodology is focused on understanding and analyzing a single static collection of text documents. Corresponding analytic activities include summarizing the contents of the collection, grouping the documents based on similarity of content, and calculating concise summaries of the resulting groups. The approach reported here focuses on taking advantage of the temporal characteristics in a text stream to identify relevant features (such as change in content), and also on the analysis and algorithmic methodology to communicate these characteristics to a user. We present a variety of algorithms for detecting essential features within a text stream. A critical finding is that the characteristics used to identify features in a text stream are uncorrelated with the characteristics used to identify features in a static document collection. Our approach for communicating the information back to the user is to identify feature (word/phrase) groups. These resulting algorithms form the basis of developing software tools for a user to analyze and understand the content of text streams. We present analysis using both news information and abstracts from technical articles, and show how these algorithms provide understanding of the contents of these text streams.
Article
Text streams, collections of documents or messages that are generated and observed over time, are ubiquitous. Our research and development is targeted at developing algorithms to find and characterize changes in topic within text streams. To date, this research has demonstrated the ability to detect and describe 1) short duration, atypical events and 2) the emergence of longer term shifts in topical content. This technology has been applied to pre-defined temporally ordered document collections but is also suitable for application to near real-time textual data streams. The underlying event and emergence detection algorithms have been interfaced to an event detection software user interface named SURPRISE. This software provides an interactive graphical user interface and tools for manipulating and correlating the terms and scores identified by the algorithms. Additionally, SURPRISE has been interfaced with the IN-SPIRE text analytics tool to enable an analyst to evaluate the surprising or emerging terms via a visualization of the entire document collection. IN-SPIRE assists in the exploration of related topics, events and views currently based on single term events. The focus of this research is to contribute to detecting, and preventing, strategic surprise.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Article
A stop list, or negative dictionary is a device used in automatic indexing to filter out words that would make poor index terms. Traditionally stop lists are supposed to have included only the most frequently occurring words. In practice, however, stop lists have tended to include infrequently occurring words, and have not included many frequently occurring words. Infrequently occurring words seem to have been included because stop list compilers have not, for whatever reason, consulted empirical studies of word frequencies. Frequently occurring words seem to have been left out for the same reason, and also because many of them might still be important as index terms.This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.
Article
This article describes an evaluation of the Kea automatic keyphrase extraction algorithm. Document keyphrases are conventionally used as concise descriptors of document content, and are increasingly used in novel ways, including document clustering, searching and browsing interfaces, and retrieval engines. However, it is costly and time consuming to manually assign keyphrases to documents, motivating the development of tools that automatically perform this function. Previous studies have evaluated Kea's performance by measuring its ability to identify author keywords and keyphrases, but this methodology has a number of well-known limitations. The results presented in this article are based on evaluations by human assessors of the quality and appropriateness of Kea keyphrases. The results indicate that, in general, Kea produces keyphrases that are rated positively by human assessors. However, typical Kea settings can degrade performance, particularly those relating to keyphrase length and domain specificity. We found that for some settings, Kea's performance is better than that of similar systems, and that Kea's ranking of extracted keyphrases is effective. We also determined that author-specified keyphrases appear to exhibit an inherent ranking, and that they are rated highly and therefore suitable for use in training and evaluation of automatic keyphrasing systems.
Article
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.