Conference PaperPDF Available

Applying key phrase extraction to aid invalidity search

Authors:

Abstract and Figures

Invalidity search poses different challenges when compared to conventional Information Retrieval problems. Presently, the success of invalidity search relies on the queries created from a patent application by the patent examiner. Since a lot of time is spent in constructing relevant queries, automatically creating them from a patent would save the examiner a lot of effort. In this paper, we address the problem of automatically creating queries from an input patent. An optimal query can be formed by extracting important keywords or phrases from a patent by using Key Phrase Extraction (KPE) techniques. Several KPE algorithms have been proposed in the literature but their performance on query construction for patents has not yet been explored. We systematically evaluate and analyze the performance of queries created by using state-of-the-art KPE techniques for invalidity search task. Our experiments show that queries formed by KPE approaches perform better than those formed by selecting phrases based on tf or tf-idf scores.
Content may be subject to copyright.
Applying Key Phrase Extraction to aid Invalidity Search
Manisha Verma
Language Technologies Research Centre
International Institute of Information Technology
Hyderabad, India
manisha.verma@research.iiit.ac.in
Vasudeva Varma
Language Technologies Research Centre
International Institute of Information Technology
Hyderabad, India
vv@iiit.ac.in
ABSTRACT
Invalidity search poses different challenges when compared
to conventional Information Retrieval problems. Presently,
the success of invalidity search relies on the queries created
from a patent application by the patent examiner. Since a
lot of time is spent in constructing relevant queries, auto-
matically creating them from a patent would save the ex-
aminer a lot of effort. In this paper, we address the problem
of automatically creating queries from an input patent. An
optimal query can be formed by extracting important key-
words or phrases from a patent by using Key Phrase Extrac-
tion (KPE) techniques. Several KPE algorithms have been
proposed in the literature but their performance on query
construction for patents has not yet been explored. We sys-
tematically evaluate and analyze the performance of queries
created by using state-of-the-art KPE techniques for invalid-
ity search task. Our experiments show that queries formed
by KPE approaches perform better than those formed by
selecting phrases based on tf or tf-idf scores.
Categories and Subject Descriptors
Verticals and specialized search [Miscellaneous]:
Keywords
Patent Retrieval, Keyphrase Extraction
1. INTRODUCTION
Patents give exclusive rights to the inventor for using and
protecting his intellectual property. For a patent to be
granted, the invention has to be novel, non-obvious and use-
ful. Since a lot of patents are present in digital form on the
web and with the number of patents filed and granted each
year increasing rapidly, patent examiners today use infor-
mation retrieval tools to accomplish several search tasks.
A patent engineer routinely performs search tasks like prior
art search, patentability search, novelty search and inva-
lidity search. The objective of invalidity search is to find
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICAIL’11 , June 6-10, 2011, Pittsburgh, PA
Copyright c
2011 ACM 978-1-4503-0755-0/11/06 ...$10.00.
patents or other related resources which cover the proposed
product or process and are still in force. The search result
consists of a report of all such inventions. The search pro-
cess is time consuming as several patents have to be read
to ensure no relevant patent is missed. The examiner starts
with a document and manually creates suitable queries to
search patent databases. Since a lot of time is spent in con-
structing relevant queries, transforming the document into a
query automatically would save the examiner a lot of effort.
Hence, one should be able to input a document as a query
instead of making several queries. Query formulation is still
a manual process and automating it would require the right
combination of query formation, refinement and expansion
techniques. An important aspect about invalidity search is
thenumberofrelevantdocumentsforagivenpatent.Since
very few patents may infringe an invention, the number of
relevant documents is usually small. Hence, it is not only
important to construct a query which covers the scope of
the invention but also retrieves all the relevant documents.
In this paper we explore state-of-the-art supervised and un-
supervised Key Phrase Extraction (KPE) techniques to cre-
ate queries from an input patent. Supervised KPE ap-
proaches need annotated data but publicly available patent
data is not annotated and manual annotation would require
domain expertise. Thus, we use a corpus based approach
to automatically label key phrases in patents with relevance
judgments to create training data for supervised KPE algo-
rithms. We use NTCIR 61collection of 1.3 million patents
and 1000 query patents to conduct all the experiments.
In Section 2, we discuss the current state of the art in patent
retrieval and key phrase extraction. Section 3, discusses mo-
tivation and contributions of the approach. Unsupervised
and supervised key phrase extraction techniques used for ex-
periments have been explained in Section 4. The approach
to annotate patents with phrases is explained in same sec-
tion. The experiments, result and analysis are explained
in Section 5. Conclusion and Future work are discussed in
Section7andSection8respectively.
2. RELATED WORK
Prior-art retrieval and invalidity search have received con-
siderable attention from the research community recently.
Several workshops by NTCIR 2and CLEF 3have been con-
1http://research.nii.ac.jp/ntcir/index-en.html
2http://research.nii.ac.jp/ntcir/publication1-en.html
3http://www.ir-facility.org/the irf/clef-ip09-track
ducted to evaluate and improve the state of the art in patent
retrieval. Patent retrieval poses a unique challenge as the
language of patents is not only vague but also contains a
lot of new terms and concepts introduced by the inventor.
This results in a lot of content that discusses similar aspects
but uses different vocabulary which makes search for simi-
lar patents a daunting task. Patents are lengthy but well-
structured documents and contain a title, abstract, descrip-
tion, summary of invention and claims. The claims define
the scope of protection granted by the patent. All patents
have manually-assigned International Patent Classification
(IPC) classification codes.
Several approaches have been proposed to improve patent
retrieval. Some systems use entire claim text as a query
[11] or use information in the patent text to form queries
[5, 13, 14] and modify existing retrieval models to improve
performance [3, 7, 8].
Other approaches identify strong keywords for query con-
struction from the input patent and then expand these queries
using relevance feedback. Bashir et al.[2] analyze the bias
of several retrieval systems and query expansion techniques.
They propose a query expansion technique based on cluster-
ing to identify dominant relevant documents. An extension
of the above work is proposed in [1] where several SMART
similarity metrics are used to select better terms for query
expansion. However, they construct queries of only 2, 3 and
4 words which may not be the case in the real world. And
for each patent added to the database these queries will have
to be reconstructed to measure the retrievability of the doc-
ument. Morphological analysis has also been used in [9] to
extract words from the claim for a query. These words are
used to find related terms from ‘detailed description of the
invention’ and the related terms are used for query expan-
sion.
Xue et al. [18, 19] explore ways to create a query from a
patent. They propose a generalized algorithm for extracting
query words from a patent. They evaluate queries formed
by words extracted from different sections in a patent. They
empirically determine how many query words should be kept
in the query. Different weighting methods are also used to
weigh words in the query.
Retrieval models proposed in the literature will not perform
well if the query is constructed with weak key phrases. Thus
selection of right phrases becomes an important step. In
[10] candidate n-grams are selected using a classifier. The
authors manually annotate potential keywords to train the
classifier. Extension of their approach to patents from sev-
eral areas would again require domain expertise. Since this
is a time consuming and expensive task, it is infeasible to
annotate large volumes of patents in the absence of an ex-
pert.
3. MOTIVATION AND CONTRIBUTION
Initial experiments indicated that selecting phrases on basis
of frequency counts (tf or tfidf )resultedinpoorqueries.
Though several key phrase extraction approaches have been
proposed in the literature, they have not been used to create
queries for invalidity search task. Our contributions are :
1. We systematically evaluate and analyze the perfor-
mance of queries created by using state-of-the-art un-
supervised key phrase extraction techniques.
2. We propose a corpus based mechanism to annotate
query keywords in patents for which related patents
are known. These patents are used to train supervised
key phrase extraction techniques.
4. KEY PHRASE EXTRACTION
The aim of Key phrase extraction (KPE) algorithms is to
find out phrases or words that represent important units
of information in a document. Key phrases are used in
several applications like document categorization, clustering
and summarization. A KPE algorithm takes a document
as input and outputs phrases or words that represent the
document.
A list of phrases, generated by a KPE algorithm, could
succinctly represent a complex and lengthy patent. These
phrases could then be used to form queries to search for
similar patents. Informative phrases will be able to retrieve
relevant patents whereas the results for weak phrases will
be noisy and irrelevant. Thus, phrases extracted by KPE
techniques could be used to search for similar patents.
A KPE technique is supervised or unsupervised. Unsuper-
vised approaches use co-occurrence statistics or frequency
counts to extract and score candidate phrases from a doc-
ument. Unsupervised approaches do not need any labeled
data. For certain corpora (e.g. research articles), key phrases
annotated by the experts are available. Supervised approaches
are trained on this data to extract phrases from new doc-
uments. We label some patents with key phrases to create
training data for supervised approaches. For all approaches
the top phrases are used to form a query to find similar
patents and each phrase in the query has the same weight.
The performance of queries formed by the supervised ap-
proaches is compared with those formed by the unsuper-
vised approaches by using the relevance judgments of query
patent. We explore two state-of-the-art supervised and un-
supervised approaches respectively to extract phrases from
a patent. The approaches are briefly explained in the fol-
lowing subsections.
4.1 Unsupervised KPE Approaches
We use two approaches - TextRank [12] and SingleRank [16].
These algorithms use co-occurrence statistics to score words
and identify phrases from a document. These approaches use
the information around a word to calculate its importance
whereas tf or tf-idf scores do not reflect this information.
4.1.1 TextRank
TextRank algorithm represents a document as a graph. Each
vertex in the graph corresponds to a word. There is an edge
between any two words occurring together. A weight, wij ,
is assigned to the edge connecting two vertices, viand vj,
and its value is the number of times the corresponding word
co-occur within a window of W words in the document. The
score of a vertex reflects its importance. The score for vi,
S(vi), is initialized with a default value and is computed in
an iterative manner until convergence using this recursive
formula:
S(vi)=(1d)+d×X
vkAdj(vi)
wji
PvkAdj(vj)wjk
S(vj)(1)
where Adj(vi)denotesvi’s neighbors in the graph and dis
the damping factor. Intuitively, a word will receive a high
score if it has many high-scored neighbors. After conver-
gence, the T% top-scored vertices (words) are selected to as
keywords. Adjacent keywords are then collapsed and output
as a key phrase.
4.1.2 SingleRank
SingleRank follows the same approach as TextRank but dif-
fers in the following aspect. While in TextRank phrases
containing the top-ranked words are selected, in SingleRank,
we do not filter out any low scoring words. Each candidate
key phrase, which can be any longest-matching sequence of
nouns and adjectives, is given a score by summing the scores
of its constituent words obtained from the graph. Top N
highest-scored phrases are output as key phrases.
4.2 Supervised KPE Approaches
For some corpora, key phrases annotated by experts are
available. Supervised approaches are trained on this data
and the model is used to extract phrases from new docu-
ments. We use two approaches - KEA and RankPhrase to
form queries for input patents.
4.2.1 KEA
KEA [17] is a popular supervised approach for key phrase
extraction. KEA makes a list of candidate phrases by ex-
tracting n-grams of a predefined length (e.g. 1 to 3 words)
that do not start or end with a stopword. It calculates fea-
ture values for each candidate phrase, and uses naive bayes
classifier to predict important phrases. A model is trained
using documents with known key phrases, and then used to
find phrases in new documents. We use the existing imple-
mentation of KEA4and train it on phrases generated by our
annotation approach. Features used to represent a phrase
are given below
1. Tf-Idf : It assigns a score to each word win a doc-
ument dbased on its frequency tfw(term frequency)
in dand remaining documents in the corpus idf (in-
verse document frequency). It is given by tfidf =
tfw×log(N
Dw) where Nisthenumberofdocuments
in the corpus and Dwis the number of documents in
which woccurs.
2. First occurrence : It is computed as the percentage
of the document preceding the first occurrence of the
term in the document. Terms that tend to appear at
the start or at the end of a document are more likely
to be key phrases.
3. Length of a phrase : It is the number of words in a
phrase.
4http://www.nzdl.org/Kea/
4.2.2 RankPhrase
Recently it has been proposed that key phrase extraction is a
problem of ranking and not that of classification [6]. Instead
of using a classifier, they use a Learning-To-Rank approach
to train a ranking model on phrases and the model ranks
phrases from a new document. A ranked list of phrases is
used for training a model. This model then predicts the
order for phrases of a new patent. We use the features pro-
posed in KEA to represent phrases in the document.
4.2.3 Annotating Data
To use supervised KPE approaches, patents will have to be
manually annotated for key phrases. This is a difficult task
since (1) manual annotation would require domain expertise,
given the number of fields in which patents are written, it
is infeasible to annotate patents for each domain and (2) it
is laborious, expensive, time consuming and prone to inter-
expert labeling variability. Thus, we use a corpus based
approach to annotate important key phrases in a patent.
NTCIR 6 dataset provides a list of relevant documents (rel-
evance judgments) for some query patents. We use these
query patents to create training data for supervised ap-
proaches. We consider those phrases of the query patent to
be important, which when treated as queries, can retrieve
related documents from the corpus.
To identify candidate phrases, we use the relevance judg-
ments of the query patent and a search engine. The process
of labeling phrases is as follows: A chunker is used to ex-
tract all the noun phrases from a patent and each phrase is
stemmed using a stemmer. Stop words are removed from
the phrases using a pre defined list. Each phrase in the re-
sultant list is fired as a query in the search engine. If the
phrase is able to retrieve documents that are relevant to the
query patent, it is informative i.e. it captures some impor-
tant information about the query patent, hence it can be a
candidate key phrase for that patent. Thus, each phrase is
treated as a query and its Mean Average Precision (MAP)
and Recall are calculated using the relevance judgments of
the input patent. The phrases with MAP and Recall greater
than zero are considered to be informative. To remove noisy
chunks (symbols, abbreviations etc) from the list, we con-
sider chunks of length greater than θletters. We select
phrases with high MAP and Recall values and then rank
them based on tf-idf scores. Top phrases are used to rep-
resent the query patent. The algorithm is summarized in
Algorithm 1.
The process described above may select phrases which may
not be informative but still retrieve documents relevant to
the query patent. Our objective is not to identify only the
important information but also to identify those words which
will help in retrieving relevant documents. For example,
an expert will select phrases specific to the invention, these
phrases may or may not form good queries. While creating
queries manually, one might miss terms which may not be
specific to the invention but will help in retrieving similar
patents. The terms that may not be specific to the invention,
will get selected based on MAP and Recall values. Ranking
these terms based on tf-idf scores will ensure that they are
also informative.
Algorithm 1 Algorithm to annotate patents for key phrases
Input: Chunked noun phrase list: CL, Key phrases list:
KPL [],NounPhrase: NP
for all NPiin CL do
Remove stop words
if len(NPi)θthen
add NPito tempList
end if
end for
for all NPiin tempList do
Search NPiin the corpus D
Retrieve relevant documents
if MAP(NPi)>0Recall(NPi)>0then
add NPito KPL
end if
end for
Sort KPL in descending order of MAP(NPi)
Sort KPL in descending order of tf-idf
Final Key phrases Top 30 p h rase s i n KPL
Table 1: Distribution of 1000 Queries
Domain #pat Domain #pat
Tran s p ort 31 Electronics 262
Chemistry 45 Engine/Pumps 15
Text i le 2Separating/Mixing 6
Instruments 543 Agriculture 75
Mining 20
The apparent advantage of our approach is its simplicity.
With the help of a search engine and available relevance
judgments for patents, one can create a list of suitable key
phrases in no time. These phrases can be used to train mod-
els for supervised KPE algorithms, to suggest keywords to
a user who is searching for similar patents etc. Another ad-
vantage is the ability of the approach to cover patents in
several domains. One might argue that a patent may not
contain any phrase with non-zero MAP and Recall. But dur-
ing the experiments it was observed that since the patents
are such lengthy documents, they contain a lot of vocabu-
lary and each patent would undoubtedly yield some phrases
which have a MAP or Recall value greater than zero.
5. EVALUATION
5.1 Corpus
We test the performance of each algorithm on the NTCIR
6 dataset of 13 million USPTO patents from 1993 to 2000.
We take 1000 sample patent applications as queries. The
list of relevant documents for each of these applications is
provided with the dataset. The average number of phrases
per patent is 1001. The patents contain four main sections -
title, abstract, claim and description which is further divided
into summary and brief description. The division of query
patents according to their IPC codes is given in Table 1.
5.2 Experimental Setup
Lucene5has been used to index the data and Snowball An-
alyzer6with a manually determined list of 146 stop words
5http://lucene.apache.org/java/docs/index.html
6http://snowball.tartarus.org/
0.14
0.15
0.16
0.17
0.18
0.19
0.2
10 20 30 40 50 60
R@100
No of phrases
tf-all
tfIdf-all
tf-desc
tfIdf-desc
Figure 1: Effect of number of phrases on Recall
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0 0.05 0.1 0.15 0.2 0.25 0.3
Recall
T (% of nodes)
R@100
R@200
Figure 2: Effect of T% (TextRank) on Recall
has been used to analyze the corpus and queries. The data
was indexed with four fields - title, abstract, description
and claim. Opennlp7has been used to POS tag and chunk
patents. Vector space model has been used to retrieve doc-
uments. For every query patent, key phrases have been ex-
tracted using the approaches described in Section 4. To
determine the number of phrases in a query, we formulate
queries with 10 to 60 phrases. Phrases were selected from
whole patent (all ) and description (desc)basedontf and
tf-idf. R@100 for queries are shown Figure 1. With more
words, the change in the performance is not significant, but
the time spent on search is significantly increased. Since
there is a very minute difference between Recall of query
with 40 and 60 words, we limit the number of phrases in
the query to 40 phrases. The queries are formed by se-
lecting phrases based on tf-idf from description section of
the patent. Top 40 phrases are used to form a boolean ‘OR’
query to search for similar patents. Note that all the phrases
in the query have the same weight.
We use publicly available implementation of TextRank and
SingleRank [4]. The damping factor is set to 0.85. In Tex-
7http://opennlp.sourceforge.net/
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
5 10 15 20 25 30 35 40
Recall
No of phrases
R@100
R@200
Figure 3: Effect of N (SingleRank) on Recall
tRank, the number of vertices used to create phrases (T%)
was varied from 0.05% to 30%. Experiments show that bet-
ter phrases are generated when top 0.15% nodes from the
graph are collapsed to form phrases. In SingleRank, we vary
N from 5 to 40 phrases. It was found that N=30 resulted
in better phrases. R@100 and R@200 of queries formed by
varying N in SingleRank and T% in TextRank are shown
in Figure 3 and Figure 2 respectively. We tested both ap-
proaches on 1000 query patents and use window size of 2, 3,
4 and 5 words to create the co-occurrence graph. TextRank
and SingleRank have been tested on individual sections (ab-
stract, description and claims) in the patent also. Since
individual sections contain less vocabulary, small values of
T result in very few phrases. To prevent this, the value of T
was increased to 50%, 30% and 30% for abstract, description
and claims respectively.
Query patents, with relevance judgments, are used to create
training data for supervised approaches. The value of θis
set to 4 for testing the algorithm. The supervised KPE ap-
proaches have been trained and tested on these queries. We
use publicly available implementation of KEA. SVMRank8
has been used to train and test the model to rank phrases
of a query patent. For supervised approaches we use 5-fold
cross validation. KEA and RankPhrase have been trained
on 5 subsets of queries; each has 400 patents for training and
200 for testing. Since invalidity search is a recall oriented
task and a patent examiner usually scans top 200 documents
[15] we report Recal l values at 10, 30, 100 and 200.
Performance of queries formed by selecting phrases from the
description section (from [18]) on the basis of - frequency of a
phrase in the patent (tf),tf-idf and idf is the baseline for our
work. We perform paired t-test to calculate the statistical
significance.
6. RESULTS AND DISCUSSION
The results for queries formed by phrases based on tf,idf
and tf-idf are shown in Table 2. Unsupervised approaches
have been tested with window size of 2, 3, 4 and 5 words to
create the co-occurrence graph. The performance of queries,
8http://svmlight.joachims.org/
formed by selecting phrases from individual sections and en-
tire patent Table 3 and Table 2 respectively. Results of su-
pervised approaches trained on the annotated data are given
in Table 4.
Queries based on Inverse document frequency (idf )contain
rare phrases, i.e. phrases which occur in less number of doc-
uments. There are two reasons why they do not perform
well. First is, though two patents may claim the same in-
vention they use different terminology to describe it, hence
the vocabulary overlap between them is minimal. Due to
this reason several phrases in a patent have high idf val-
ues. But extremely low occurrence of a phrase in the corpus
does not indicate that it is important and queries with such
phrases may not necessarily be informative. Second reason
is that idf values do not reflect the frequency with which
a phrase has been used in the input patent. For example,
a phrase may be present in a small fraction of documents
(high idf) in the corpus but occurs only once (low tf ) in the
document. This phrase, if used to search for similar patents
may not retrieve relevant documents.
Queries formed by selecting phrases based on their frequency
in the document (tf) and tf-idf perform better. This is due
to the presence of document level information about the
phrase in its tf and tf-idf score. The author of a patent uses
some phrases repeatedly to describe a part or component
of the invention. Such important phrases have higher fre-
quency than others. Selecting phrases based on tf ensures
that these informative phrases are present in the queries.
Queries based on tf-idf contain phrases which have both
high tf and idf. As a result, tf-idf queries perform better
than both - tf and idf queries.
Both unsupervised approaches have been tested on 1000
query patents with window size of 2, 3, 4 and 5 words to
create the co-occurrence graph. Unsupervised approaches
yield slightly better queries than tf-idf. This is due to the
co-occurrence information used by both TextRank and Sin-
gleRank. The performance of unsupervised approaches de-
grades as the window size increases which is intuitive, since,
increasing the size weakens semantic relation between far-
thest words in the phrase. The is considerable difference
in the performance of TextRank and SingleRank. MAP
and R@100 of TextRank are more than that of SingleRank.
This can be explained by the following: TextRank requires
that every word of a key phrase must appear among the top
ranked unigrams. SingleRank, on the other hand does not
require all unigrams of a key phrase to be present in the
top-ranked list of words. TextRank has a fairly strict crite-
rion, in comparison to SingleRank, which helps in lowering
the importance of those phrases which do not contain any
top ranked word from the graph, this in turn helps in re-
ducing the noise and better key phrases are retrieved using
TextRank. Recall values of TextRank are also higher than
SingleRank and tf-idf.
The performance of TextRank and SingleRank is dependent
on the graph constructed from the patent text. Both the
approaches do not perform well when individual sections of
a patent are used to extract key phrases. This can be ex-
plained as follows: (1) Individual sections represent the doc-
ument partially which provides incomplete estimate of co-
Table 2: Performance of queries formed by unsupervised approaches
MAP R@10 R@30 R@100 R@200
tf 0.0414 0.0365 0.0860 0.1740 0.2390
idf 0.0140 0.0201 0.0325 0.0510 0.0640
tf-idf 0.0428 0.0365 0.0870 0.1781 0.2412
Text R ank
20.0458 0.0456 0.0969 0.1885 0.2606
30.0455 0.0449 0.0969 0.1859 0.2576
40.0452 0.0444 0.0958 0.1848 0.2561
50.0454 0.0440 0.0960 0.1853 0.2560
SingleRank
20.0340 0.0316 0.0689 0.1380 0.2010
30.0336 0.0314 0.0687 0.1360 0.1925
40.0333 0.0310 0.0675 0.1362 0.1930
50.0332 0.0309 0.0671 0.1357 0.1910
Table 3: Performance of unsupervised approaches
on patent fields (TR: TextRank, SR: SingleRank)
MAP R@10 R@30 R@100 R@200
abst TR 0.0285 0.0245 0.0560 0.1230 0.1775
desc TR 0.0400 0.0312 0.0737 0.1555 0.2234
claim TR 0.0321 0.0274 0.0641 0.1390 0.2000
abst SR 0.0305 0.0269 0.0629 0.1325 0.1890
desc SR 0.0335 0.0321 0.0682 0.1356 0.1915
claim SR 0.0328 0.0315 0.0648 0.1335 0.1900
occurrence statistics and (2) Since entire document text pro-
vides a better estimate of edge weights in the graph, it results
in better ranking of vertices (words). In individual sections,
the longer sections will perform better than shorter sections.
This is reflected in the experiment results too. The entire
patent text yields better phrases than individual sections. In
individual sections, TextRank and SingleRank perform bet-
ter when description is used to form co-occurrence graph.
‘Description’ section in a patent contains more vocabulary
than other sections, this results in better edge weights in the
co-occurrence graph, which in turn results in better phrases.
In supervised approaches, queries created by using phrases
extracted by KEA show 29% and 37% improvement in MAP
and 27% and 29% improvement in R@100 over TextRank
and tf-idf respectively. There is a substantial improvement
over all the other KPE approaches as well. This indicates
that queries formed by combining phrases output by KEA,
were better than those created from other approaches. Since
the queries performed well, better phrases were selected by
the KEA algorithm. Since the phrases were informative, it
can be deduced that KEA was provided reasonable training
data. Though RankPhrase approach performs better than
unsupervised approaches, it does not match the performance
of KEA in extracting phrases. This can be attributed to the
length of patent documents. Since the patent documents
are lengthy, the number of phrases used is more, hence some
important phrases will not be ranked correctly by the ap-
proach which lowers the performance of the queries as some
important phrases are missed. Our experiments indicate
that queries made by using phrases from KPE techniques
certainly improve invalidity search, only that improvement
is more when supervised approaches are used for extraction.
To find out the performance of key phrase extraction tech-
niques in creating queries, it was important that experiments
are conducted with both supervised and unsupervised KPE
Table 4: Performance of queries formed by super-
vised approaches
MAP R@10 R@30 R@100 R@200
KEA 0.059 0.054 0.121 0.230 0.315
RankPhrase 0.052 0.053 0.108 0.200 0.284
approaches. The experiments show that queries formed by
using KPE approaches can indeed improve patent retrieval.
7. CONCLUSION
Automatic construction of queries from patents would be
useful in applications like invalidity search. The current ap-
proach is to create a query from the patent by selecting top
Kkeywordsbasedonsomescore. Inthisworkwetriedto
find out the performance of queries made by using phrases
extracted by popular key phrase extraction techniques. We
used both supervised and unsupervised key phrase extrac-
tion algorithms to extract phrases from a patent application
and form queries to search for similar patents. The perfor-
mance of these queries is compared with those formed by
selecting phrases based on tf,idf and tf-idf. The results in-
dicate that tf-idf is not a good metric to select key phrases
to form queries from input patents. Queries created by us-
ing unsupervised and supervised approaches perform better
than those formed by tf or tf-idf. To train supervised KPE
approaches labeled data is required. Since there is no an-
notated data for candidate keywords in patents, we propose
an approach to annotate important key phrases in patents.
Supervised approaches are trained on this data. The experi-
ments indicate that key phrase extraction techniques indeed
improve invalidity search results. In supervised approaches,
queries created by using phrases extracted by KEA show
29% and 37% improvement over TextRank and tf-idf respec-
tively. Since queries generated by supervised approaches
perform better than those generated by unsupervised ap-
proaches, it can be inferred that our annotation approach is
able to label informative phrases in a patent.
8. FUTURE WORK
For future work, the queries generated by these approaches
could be expanded or weighed to improve retrieval. We shall
evaluate the performance our annotation approach and KPE
techniques on multilingual patent datasets. In future, we
will explore how structure of a patent, frequency count and
co-occurrence information can be incorporated in one key
phrase extraction algorithm to improve performance. Com-
bination of unsupervised and supervised approaches to cre-
ate queries from patents will also be explored in future.
9. REFERENCES
[1] S. Bashir and A. Rauber. Analyzing document
retrievability in patent retrieval settings. In
S. Bhowmick, J. K˜
Aijng, and R. Wagner, editors,
Database and Expert Systems Applications,volume
5690 of Lecture Notes in Computer Science, pages
753–760. Springer Berlin / Heidelberg, 2009.
[2] S. Bashir and A. Rauber. Improving retrievability of
patents with cluster-based pseudo-relevance feedback
documents selection. In CIKM ’09: Proceeding of the
18th ACM conference on Information and knowledge
management, pages 1863–1866, New York, NY, USA,
2009. ACM.
[3] A. Fujii. Enhancing patent retrieval by citation
analysis. In SIGIR ’07: Proceedings of the 30th annual
international ACM SIGIR conference on Research and
development in information retrieval, pages 793–794,
New York, NY, USA, 2007. ACM.
[4] K. S. Hasan and V. Ng. Conundrums in unsupervised
keyphrase extraction: Making sense of the
state-of-the-art. In Proceedings of COLING 2010:
Posters Volume, pages 365–373, 2010.
[5] K. V. Indukuri, A. A. Ambekar, and A. Sureka.
Similarity analysis of patent claims using natural
language processing techniques. In ICCIMA ’07:
Proceedings of the International Conference on
Computational Intelligence and Multimedia
Applications (ICCIMA 2007), pages 169–175,
Washington, DC, USA, 2007. IEEE Computer Society.
[6] X. Jiang, Y. Hu, and H. Li. A ranking approach to
keyphrase extraction. In SIGIR ’09: Proceed in gs o f the
32nd international ACM SIGIR conference on
Research and development in information retrieval,
pages 756–757, New York, NY, USA, 2009. ACM.
[7] I.-S. Kang, S.-H. Na, J. Kim, and J.-H. Lee.
Cluster-based patent retrieval. Inf. Process. Manage.,
43(5):1173–1182, 2007.
[8] J. Kim, I.-S. Kang, and J.-H. Lee. Cluster-based
patent retrieval using international patent
classification system. In Y. Matsumoto, R. Sproat,
K.-F. Wong, and M. Zhang, editors, Computer
Processing of Oriental Languages. Beyond the Orient:
The Research Challenges Ahead, volume 4285 of
Lecture Notes in Computer Science, pages 205–212.
Springer Berlin / Heidelberg, 2006.
[9] K. Konishi, A. Kitauchi, and T. Takaki. Invalidity
patent search system of ntt data. In Working Notes of
the Fourth NTCIR Workshop Meeting, NII, 2005.
[10] P. Lopez and L. Romary. Experiments with citation
mining and key-term extraction for Prior Art Search.
In CLEF 2010 - Conference on Multilingual and
Multimodal Information Access Evaluation, Padua
Italie, 2010.
[11] H. Mase, T. Matsubayashi, Y. Ogawa, M. Iwayama,
and T. Oshio. Proposal of two-stage patent retrieval
method considering the claim structure. ACM
Transactions on Asian Language Information
Processing (TALIP), 4(2):190–206, 2005.
[12] R. Mihalcea and P. Tarau. TextRank: Bringing order
into texts. In Proc.ofEMNLP, 2004.
[13] T. Takaki, A. Fujii, and T. Ishikawa. Associative
document retrieval by query subtopic analysis and its
application to invalidity patent search. In CIKM ’04:
Proceedings of the thirteenth ACM international
conference on Information and knowledge
management, pages 399–405, New York, NY, USA,
2004. ACM.
[14] S. Tiwana and E. Horowitz. Extracting problem
solved concepts from patent documents. In PaIR ’09:
Proceeding of the 2nd international workshop on
Patent information retrieval, pages 43–48, New York,
NY,USA,2009.ACM.
[15] Y. H. Tseng and Y. J. Wu. A study of search tactics
for patentability search: a case study on patent
engineers. In PaIR ’08: Proceeding of the 1st ACM
workshop on Patent information retrieval, pages
33–36, New York, NY, USA, 2008. ACM.
[16] X. Wan, J. Yang, and J. Xiao. Towards an iterative
reinforcement approach for simultaneous document
summarization and keyword extraction. In Proceedings
of the 45th Annual Meeting of the Association of
Computational Linguistics, pages 552–559, Prague,
Czech Republic, June 2007. Association for
Computational Linguistics.
[17] I.H.Witten,G.W.Paynter,E.Frank,C.Gutwin,
and C. G. Nevill-Manning. Kea: practical automatic
keyphrase extraction. In DL ’99: Proceedings of the
fourth ACM conference on Digital libraries, pages
254–255, New York, NY, USA, 1999. ACM.
[18] X. Xue and W. B. Croft. Automatic query generation
for patent search. In CIKM ’09: Proceeding of the 18th
ACM conference on Information and knowledge
management, pages 2037–2040, New York, NY, USA,
2009. ACM.
[19] X. Xue and W. B. Croft. Transforming patents into
prior-art queries. In SIGIR ’09: Proceedings of the
32nd international ACM SIGIR conference on
Research and development in information retrieval,
pages 808–809, New York, NY, USA, 2009. ACM.
... Keyword-based method has an assumption that query words have the same word presentation as words from target document. If some words don't appear in the target documents, query performance would degrades (Mahdabi et al. 2011;Verma and Varma 2011a;Bashir and Rauber 2010). To avoid vocabulary mismatch between query and target document, several semantic-based methods have be proposed query with word meanings rather than relying on keyword matches. ...
... These semantic methods are based on dictionary-based (Harper 2011;Tannebaum andRauber 2012) or corpus-based (Al-Shboul andMyaeng 2014;Cao et al. 2008;Krestel and Smyth 2013). These keyword-based methods extract important keywords for the whole patent document, and these keywords being regarded as independent unordered words result in semantic splitting, which also lead to losing the original semantics of the phrases (Nidhi et al. 2019;Verma and Varma 2011a;Wu et al. 2019). The other method is based on meta-data structure. ...
Article
Full-text available
Software defined network (SDN) has gained a great attention in academic field for its separation of the control plane and the data plain to get a programmable network. In the study, we propose an SDN architecture for a scenario of intelligent patent prior art search system. Different from the current mainstream patent retrieval system, where patent prior art search is executed by means of traditional keywords matches under a fixed network topology, our proposed patent prior art search system based on SDN architecture can provide systematic and security analysis of patent text when encountering big data flows of patent applications. We also propose a new Phrase-based patent text representation model (PPTR), where the whole patent text is represented as a Bag of Phrases and then embedded into vector for patent prior art search, which could maintain the integrity of semantic units of patent text. Our experiments show that the proposed PPTR model achieves the best performance compared with traditional approaches of patent prior art search, and it is also expected that SDN architecture is a promising platform framework for other patent mining tasks.
... In this experiment, we considered only the usage of words in each patent, regardless of whether they were used in the abstract, claim, or other parts. However, as several studies have been performed on patent structure analysis [14,15,[23][24][25][26][27][28][29][30][31][32][33][34][35] and keyword extraction analysis [14,15,26,27], there is still scope for further estimation accuracy improvement by, for example, applying weights to word usage in accordance with the location of the words (e.g., in the abstract, claims, or other parts). ...
... In this experiment, we considered only the usage of words in each patent, regardless of whether they were used in the abstract, claim, or other parts. However, as several studies have been performed on patent structure analysis [14,15,[23][24][25][26][27][28][29][30][31][32][33][34][35] and keyword extraction analysis [14,15,26,27], there is still scope for further estimation accuracy improvement by, for example, applying weights to word usage in accordance with the location of the words (e.g., in the abstract, claims, or other parts). ...
Conference Paper
We propose a model to improve estimation accuracy of the future sales volume, focusing on pharmaceutical products, from their patents. Our approach is based on an analysis of patents obtained in the early development stages of the products. The development of pharmaceuticals often takes a long time (up to several decades in some cases), and the costs are huge, even exceeding one billion USD for just one product. Therefore, it is strongly desirable to estimate future sales volume at an early stage. One piece of information potentially useful for the estimation is the brand, i.e., the name of the developing company. Our model learns the sales volume and words used in multiple patent specifications and also focuses on the extent to which “seasonal” words are used. Experiments showed that our model much improved the accurately of the sales volume estimation compared with the case of just estimating from its brand name.
...  Caption text this text is added/superimposed at the time of editing the images and this is the powerful source of finding the text in the images/videos and use text for indexing, retrieval and summarization of information [13].  And another way is scene text this text is present naturally in the scene shoted by images and videos used to extract information. ...
...  And another way is scene text this text is present naturally in the scene shoted by images and videos used to extract information. There are lots of problem extracting scene based text some are resolution, brightness, blurring effects, complex background etc .Because of this caption based technique for images and videos search is the best option for searching information based on text extraction [13]. Algorithm's used in text retrieval from images and videos ...
... For example, we can improve the retrieval effectiveness by adding syntactic phrases in the form of dependency triples, to a bag-of-words representation [D'hondt et al., 2011]. Key phrase extraction (KPE) algorithm is another way to form a query based on phrases; a list of phrases -generated by a KPE algorithm -can succinctly represent a complex and lengthy patent [Verma and Varma, 2011a]. Kim and Croft [2014] recently worked on generating diverse queries from the patent query that can improve overall retrieval effectiveness in sessions rather than generating a single best query that can retrieve more relevant documents from a single retrieval result (i.e., more relevant documents in aggregated retrieval results obtained by multiple queries in a session). ...
... This allows patent searchers to conduct reasonably effective retrieval even in languages that they do not understand. All previous work, considered IPC code in their search, reported improvement in retrieval effectiveness [Fujita, 2005;Kang et al., 2007;Herbert et al., 2009;Graf et al., 2010;Harris et al., 2009Harris et al., , 2010Harris et al., , 2011Verma and Varma, 2011a]. It has also reported that using complete IPC code leads to better results compared to the use of just 4-digit code [Gobeill et al., 2010]. ...
Thesis
Full-text available
A patent is a set of exclusive rights granted to an inventor to protect his invention for a limited period of time. Patent prior art search involves finding previously granted patents, scientific articles, product descriptions, or any other published work that may be relevant to a new patent application. Many well-known information retrieval (IR) techniques (e.g., typical query expansion methods), which are proven effective for ad hoc search, are unsuccessful for patent prior art search. In this thesis, we mainly investigate the reasons that generic IR techniques are not effective for prior art search on the CLEF-IP test collection. First, we analyse the errors caused due to data curation and experimental settings like applying International Patent Classification codes assigned to the patent topics to filter the search results. Then, we investigate the influence of term selection on retrieval performance on the CLEF-IP prior art test collection, starting with the description section of the reference patent and using language models (LM) and BM25 scoring functions. We find that an oracular relevance feedback system, which extracts terms from the judged relevant documents far outperforms the baseline (i.e., 0.11 vs. 0.48) and performs twice as well on mean average precision (MAP) as the best participant in CLEF-IP 2010 (i.e., 0.22 vs. 0.48). We find a very clear term selection value threshold for use when choosing terms. We also notice that most of the useful feedback terms are actually present in the original query and hypothesise that the baseline system can be substantially improved by removing negative query terms. We try four simple automated approaches to identify negative terms for query reduction but we are unable to improve on the baseline performance with any of them. However, we show that a simple, minimal feedback interactive approach, where terms are selected from only the first retrieved relevant document outperforms the best result from CLEF-IP 2010, suggesting the promise of interactive methods for term selection in patent prior art search.
... Approaches for automatically adapting and extending queries still require the patent examiner to manually formulate the initial search query. To make this step obsolete, heuristics can be used to automatically extract keywords from a given patent application [27][28][29] or a bag-ofwords (BOW) approach can be used to transform the entire text of a patent into a list of words that can then be used to search for its prior art [30][31][32]. Often times, partial patent applications, such as an extended abstract, may already suffice to conduct the search [31]. ...
Article
Full-text available
More than ever, technical inventions are the symbol of our society’s advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines. In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality.
... Queries can also be constructed by combining terms extracted from different fields rather than selecting terms from single field at a time, and weight them according to their log(tf)idf values may give better results 15 . However, study 49 highlights that Key Phrase Extraction (KPE) techniques work better than tf or tf-idf scores for selecting phrases, especially from "description" section, for invalidity search. The study explores various supervised and unsupervised KPE algorithms to construct an optimal query by extracting important phrases and keywords from a patent. ...
Article
Objective: Patents are critical intellectual assets for any competitive business. They can prove to be a gold mine if retrieved, analyzed and utilized appropriately. Patentability search is an important step in the patent process and missing out any relevant patent may cause expensive legal consequences. As worldwide patent collection is growing rapidly, retrieval of this enormous knowledge source has become complex and exhaustive. This paper attempts to review the studies carried out in enhancing the relevance effectiveness of patent information retrieval. Method/Analysis: Literature review presents various research works that have been carried out to yield better results in patent retrieval task by refining existing information retrieval techniques or by using standard approaches at the various stages of the patent retrieval task. This work exclusively looks at literatures dealing with retrieval of patent text. Findings: Patent retrieval is not a completely solved research domain and general information retrieval approaches do not prove effective in this domain as patents are special documents posing various retrieval challenges. The review also highlights future research directions and will help researchers working in the domain of patent retrieval. Application/Improvement: Considering the various techniques and frameworks available and their limitations, there is a lot of scope in the field of patent retrieval techniques which makes room for further research to be taken up in this domain.
... Many successful approaches (especially noticeable in the best CLEF-IP runs) have therefore made use of language-independent information, such as classification codes or citations. The benefit of this integration can be seen in the improved results [11,18,32,33,58]. On the basis of these results, one could argue that the time for other (complementary) approaches has come. ...
Article
In the last years, the interest of the IR-community in patent retrieval has risen. Various experiments have been undertaken and the problems of the field experienced in-depth. Results show that the paramount steps in the patent retrieval process are difficult to automate. This paper gives an overview over the methods that have been tested in the domain including their results and promotes a complementary, more interactive approach to patent retrieval. At last, promising areas for future research are described.
Article
Patent recommendation aims to recommend patent documents that have similar content to a given target patent. With the explosive growth in patent applications, how to recommend relevant patents from the massive number of patents has become an extremely challenging problem. The main obstacle in patent recommendation is how to distinguish the meanings of the same word in different contexts or associate multiple words that express the same meaning. In this paper, we propose a Heterogeneous Topic model exploiting Word embedding to enhance word semantics (HTW). First, we model the relationship among text, inventors, and applicants around the topic to build a heterogeneous topic model and learn the patent feature representation to capture contextual word semantics. Second, a word embedding is constructed to extract the deep semantics for associating multiple words that express the same meaning. Finally, with words as connections, the mapping from patent feature representations to patent embedding is established through a matrix operation, which integrates the information between the word embedding and patent feature representation. HTW considers the heterogeneity of patents and enhances the distinction or association among words simultaneously. The experimental results on real-world datasets show that HTW exceeds typical keyword-based methods, topic models, and embedding models on patent recommendations.
Conference Paper
An extraction tool, nowadays, has become useful for text mining researchers to find keywords and keyphrases from the documents. Performing keywords and keyphrases extraction for cross-domain information are more challenging since both domains of interest are different in word usage. In this paper, two popular keyphrases extraction tools, Maui and Carrot, are investigated, for extracting terms from cross-domain document databases. The characteristic of keywords or phrases matching among different domain collections is presented and used for determining the keyphrase extraction tool for patent documents and scientific publications. In our experiment, matching between a patent and its cited publication are the key point. For evaluation, the performance of cross-domain matching is measured by comparing the similarity measure among those extraction tool results. The experimental results show that Maui tool proves to be the appropriate keyphrases extraction tool with its best performance measured by Cosine similarity of 3.31% when compared with Carrot tool for cross-domain document collections matching.
Article
Intellectual property and the patent system in particular have been extremely present in research and discussion, even in the public media, in the last few years. Without going into any controversial issues regarding the patent system, we approach a very real and growing problem: searching for innovation. The target collection for this task does not consist of patent documents only, but it is in these documents that the main difference is found compared to web or news information retrieval. In addition, the issue of patent search implies a particular user model and search process model. This review is concerned with how research and technology in the field of Information Retrieval assists or even changes the processes of patent search. It is a survey of work done on patent data in relation to Information Retrieval in the last 20-25 years. It explains the sources of difficulty and the existing document processing and retrieval methods of the domain, and provides a motivation for further research in the area.
Conference Paper
Full-text available
The goal of this study is to understand the search tactics patent engineers apply when they perform patentability search. We hope that the result can be used to: (1) provide references for novice patent engineers and ordinary users for patentability search; (2) imply some directions for improvement of future patent search systems.
Conference Paper
Full-text available
A patent collection provides a great test-bed for cluster-based information retrieval. International Patent Classification (IPC) system provides a hierarchical taxonomy with 5 levels of specificity. We regard IPC codes of patent applications as cluster information, manually assigned by patent officers according to their subjects. Such manual cluster provides advantages over auto-matically built clusters using document term similarities. There are previous researches that successfully apply cluster-based retrieval models using language modeling. We develop cluster-based language models that employ advantages of having manually clustered documents.
Conference Paper
Full-text available
Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machinelearning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Keas effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.
Article
In this paper, we report the results for the experiments we carried out to automatically extract "problem solved concepts" from a patent document. We introduce two approaches for finding important information in a patent document. The main focus of our work is to devise methods that can efficiently find the problems an invention solves, as this can help in searching for the prior art and can be used as a mechanism for relevance feedback. We have used software and business process patents to carry out our studies.
Conference Paper
We propose an associative document retrieval method, in which a document is used as a query to search for other similar documents. Because a long document usually includes more than one topic, we first analyze a query document to extract multiple subtopics. For each subtopic element, a sub-query is produced and similar documents are retrieved with a relevance score. The relevance scores are weighted by the importance of each subtopic element and are integrated to determine the final relevant documents. In the calculation of the subtopic importance, the specificity of a query term is evaluated using entropy, which is the deviation degree of the appearances of the term in each subtopic element. We apply this method to an invalidity patent search. By exploiting certain unique features of Japanese patent claims, we use features distinguishing the preamble and the essential portion in a query patent claim. To demonstrate the effectiveness of our method, we experimentally evaluated our associative document retrieval method on five years of patent documents.
Conference Paper
High findability of documents within a certain cut-off rank is considered an important factor in recall-oriented appli- cation domains such as patent or legal document retrieval. Findability is hindered by two aspects, namely the inher- ent bias favoring some types of documents over others in- troduced by the retrieval model, and the failure to cor- rectly capture and interpret the context of conventionally rather short queries. In this paper, we analyze the bias impact of different retrieval models and query expansion strategies. We furthermore propose a novel query expan- sion strategy based on document clustering to identify dom- inant relevant documents. This helps to overcome limita- tions of conventional query expansion strategies that suffer strongly from the noise introduced by imperfect initial query results for pseudo-relevance feedback documents selection. Experiments with different collections of patent documents suggest that clustering based document selection for pseudo- relevance feedback is an effective approach for increasing the findability of individual documents and decreasing the bias of a retrieval system.
Conference Paper
Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents, requiring algorithms to be tuned more towards recall at the cost of precision. This raises important questions with respect to retrievability and search engine bias: depending on how the similarity between a query and documents is measured, certain documents may be more or less retrievable in certain systems, up to some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc. Existing accessibility measurement techniques are limited as they measure retrievability with respect to all possible queries. In this paper, we improve accessibility measurement by considering sets of relevant and irrelevant queries for each document. This simulates how recall oriented users create their queries when searching for relevant information. We evaluate retrievability scores using a corpus of patents from US Patent and Trademark Office.
Conference Paper
Searching for prior-art patents is an essential step for the patent examiner to validate or invalidate a patent application. In this paper, we consider the whole patent as the query, which reduces the burden on the user, and also makes many more potential search features available. We explore how to automatically transform the query patent into an effective search query, especially focusing on the effect of different patent fields. Experiments show that the background summary of a patent is the most useful source of terms for generating a query, even though most previous work used the patent claims.
Conference Paper
This paper addresses the issue of automatically extracting keyphrases from a document. Previously, this problem was formalized as classiflcation and learning methods for clas- siflcation were utilized. This paper points out that it is more essential to cast the problem as ranking and employ a learning to rank method to perform the task. Speciflcally, it employs Ranking SVM, a state-of-art method of learning to rank, in keyphrase extraction. Experimental results on three datasets show that Ranking SVM signiflcantly outperforms the baseline methods of SVM and Naive Bayes, indicating that it is better to exploit learning to rank techniques in keyphrase extraction.