Conference PaperPDF Available

Segmentation of patent claims for improving their readability



Content may be subject to copyright.
Segmentation of patent claims for improving their readability
Gabriela Ferraro1 2, Hanna Suominen14, Jaume Nualart1 3
1NICTA / Locked Bag 8001, Canberra ACT 2601, Australia
2The Australian National University
3University of Canberra
4University of Turku
Good readability of text is important
to ensure efficiency in communication
and eliminate risks of misunderstanding.
Patent claims are an example of text whose
readability is often poor. In this paper,
we aim to improve claim readability by
a clearer presentation. This segments the
original claim text first to components of
the preamble, transition, and body text and
then the components further to clauses. An
alternative approach would have been to
modify the claim content which is, how-
ever, prone to also changing the mean-
ing of this legal text. Our rule-based
method detects the beginning and end of
the preamble (transition) [body text] with
the accuracy of 100% and 97% (94% &
100%) [100% & 100%], respectively. In
clause segmentation, our conditional ran-
dom field (punctuation and keyword-based
baseline) has the precision of 77% (41%)
and recall of 76% (29%). The most com-
mon reasons for segmentation errors are
ambiguous coordinating conjunctions and
consecutive segmentation keywords. The
results give evidence for the feasibility of
automated claim and clause segmentation,
which may help not only inventors, re-
searchers, and other laypeople to under-
stand patents but also patent experts to
avoid future legal cost due to litigations.
1 Introduction
Clear language is important to ensure efficiency in
communication and eliminate risks of misunder-
standing. With written text, this clarity is mea-
sured by readability. In the last years, we have
witnessed an increasing amount work towards im-
proving text readility. In general, these efforts fo-
cus on making general text easier to understand
to non-native speakers and people with special
needs, poor literacy, aphasia, dyslexia, or other
language deficits.
In this paper, we address making technical text
more readable to laypeople, defined as those with-
out professional or specialised knowledge in a
given field. Technical documentation as scientific
papers or legal contracts are two genres of writ-
ten text that are difficult to understand (Alberts
et al., 2011). An extreme example that takes the
worst from both these worlds is the claim section
of patent documents: it defines the boundaries of
the legal protection of the invention by describing
complex technical issues and using specific legal
jargon (Pressman, 2006). Moreover, due to inter-
national conventions, each patent claim must be
written into a single sentence. This leads to very
long sentences with complex syntactic structures
that are hard to read and comprehend not only for
laypeople but also for technical people who are not
trained to read patent claims.
As an example of other efforts with similar
goals to improve the readability of technical text to
laypeople, we mention the CLEF eHealth shared
tasks in 2013 and 2014 (Suominen et al., 2013).
However, instead of inventors, researchers, and
other claim readers, they target patients and their
next-of-kins by developing and evaluating tech-
nologies the readability of clinical reports and help
them in finding further information related to their
condition in the Internet.
Some proposals have also been made in order to
improve claim readability, for example, by apply-
ing simplification, paraphrasing, and summarisa-
tion methods (see Section 2). However, these ap-
proaches modify the claim content. This increases
the risk of changing also the meaning, which is not
desirable in the context of patent claims and other
legal documents.
In this paper, we propose an alternative method
that focuses on clarifying the presentation of the
claim content rather than its modification. Since
readability strongly affects text comprehension
(Inui et al., 2003), the aim of this study is to make
the content of the patent claims more legible and
consequently make them easier to comprehend.
As the first steps towards this improved pre-
sentation of the patent claims, we propose to seg-
ment the original text. Our approach is data driven
and we perform the segmentation at two levels:
first, an entire claim is segmented to three com-
ponents (i.e., preamble, transition, and body text)
and second, the components are further segmented
to clauses. At the first level, we use a rule-based
method and at the second level, we apply a condi-
tional random field.
We evaluate segmentation performance statisti-
cally at both levels and in addition, we analyse er-
rors in clause segmentation qualitatively; because
our performance at the first level is almost perfect
(i.e., for detecting the beginning and end of the
preamble, the accuracy percentages are 100 and
97 and these numbers are 94 and 100 for the tran-
sition and 100 and 100 for the body text), we focus
on the errors at the second level alone. In com-
parison, we have the precision of 77 per cent and
recall of 76 per cent in clause segmentation. Even
though this performance at the second level is not
perfect, it is significantly better than the respec-
tive percentages of 41 and 29 (0.2 and 0.3) for a
baseline based on both punctuation and keywords
(punctuation only).
The rest of the paper is organised as follows:
Section 2 describes as background information to
this study what patent claims are, how to read
them, and what kind of related work exists on
claim readability. Section 3 outlines our materials
and methods. Section 4 presents the experiments
results and discussion. Finally, conclusions and
ideas for future work are presented in Section 5.
2 Background
2.1 Patent claims
Patent documents have a predefined document
structure that consists of several sections, such as
the title, abstract, background of the invention, de-
scription of the drawings, and claims. As already
mentioned, the claims can be seen as the most im-
portant section as they define the scope of legal
protection of the invention. In most modern patent
laws, patent applications must have at least one
claim (Pressman, 2006).
[Toolholder]p, [comprising]t[a holder body with
an insert site at its forward end comprising a
bottom surface and at least one side wall where
there projects a pin from said bottom surface
upon which there is located an insert having
a central bore, a clamping wedge for wedging
engagement between a support surface of the
holder and an adjacent edge surface of said
insert and an actuating screw received in said
wedge whilst threadably engaged in a bore of
said holder, said support surface and said edge
surface are at least partially converging down-
wards said wedge clamp having distantly pro-
vided protrusions for abutment against the top
face and the edge surface of said insert, char-
acterised in that the wedge consists of a pair of
distantly provided first protrusions for abutment
against a top face of the insert, and a pair of
distantly provided second protrusions for abut-
ment against an adjacent edge surface]b.
Figure 1: An example patent claim. We have used
brackets to illustrate claim components and the
sub-scripts p,t, and bcorrespond to the preamble,
transition, and body text, respectively.
The claims are written into a single sentence be-
cause of international conventions. Figure 1 pro-
vides an example claim.
Furthermore, a claim should be composed by, at
least, the following parts,
1. Preamble is an introduction, which describes
the class of the invention.
2. Transition is a phrase or linking word that re-
lates the preamble with the rest of the claim.
The expressions comprising,containing,in-
cluding,consisting of,wherein and charac-
terise in that are the most common transi-
3. Body text describes the invention and recites
its limitations.
We have also included an illustration of these
claim components in Figure 1.
Because a claim is a single sentence, special
punctuation conventions have been developed and
are being used by patent writers. Modern claims
follow a format where the preamble is separated
from the transition by a comma, the transition
from the body text by a colon, and each invention
element in the body text by a semicolon (Radack,
1995). Other specifications regarding punctua-
tion are the following text elaboration and element
combination conventions:
Table 1: Per claim demographics
Training set Test set
# tokens mean 60 66
min 7 8
max 440 502
# boundaries mean 5 5
min 1 1
max 53 41
- A claim should contain a period only in
the end.
- A comma should be used in all natu-
ral pauses.
- The serial comma1should be used to separate
the elements of a list.
- Dashes, quotes, parentheses, and abbrevia-
tions should be avoided.
Because a claim takes the form of a single sen-
tence, long sentences are common. Meanwhile,
in the general discourse (e.g., news articles) sen-
tences are composed of twenty to thirty words,
claim sentences with over a hundred words are
very frequent (see, e.g., Table 1 related to mate-
rials used in this paper). As a consequence, claims
usually contain several subordinate and coordi-
nate clauses, as they enable the aforementioned
elaboration and the combination of elements of
equal importance, respectively.
As claims are difficult to read and interpret, sev-
eral books and tutorials suggest how claims should
be read (Radack, 1995; Pressman, 2006). The first
step towards reading a claim is to identify its com-
ponents (i.e., preamble, transition, and body text).
Another suggestion is to identify and highlight the
different elements of the invention spelled out in
the body text of the claims.
The clear punctuation marks and lexical mark-
ers able the claim component segmentation, as
suggested above. Moreover, the predominance
of intra-sentential syntactic structures (e.g., subor-
dinate and coordinate constructions) favours seg-
menting patent claims into clauses. These clauses
can then be presented as a sequence of segments
which is likely to improve claim readability.
1The serial comma (also known as the Oxford comma)
is the comma used mediately before a coordination con-
junction (e.g., CDs, DVDs, and magnetic tapes where the
last comma indicates that DVDs and magnetic tapes are
not mixed). (ac-
cessed 28 Feb, 2014)
2.2 Related work
So far, not many studies have addressed the prob-
lem of improving the readability of patents claims.
In particular, to the best of our knowledge, there is
not research that specifically targets the problem
of presenting the claims in a more readable lay-
out. Consequently, we focus on efforts devoted to
claim readability in general with an emphasis on
text segmentation.
We begin by discussing three studies that ad-
dress text simplification in patent documents. Note
that these approaches modify the claim content
which may also change their meaning. This is
riskier in the context of patent documents and
other legal text than our approach of clarifying the
presentation. Moreover, in order achieve a reason-
able performance, the methods of these studies re-
quire sophisticated tools for discourse analysis and
syntactic parsing. Usually these tools also need to
be tailored to the genre of claim text.
First, a parsing methodology to simplify sen-
tences in US patent documents has been pro-
posed (Sheremetyeva, 2003). The resulting analy-
sis structure is a syntactic dependency tree and the
simplified sentences are generated based on the in-
termediate chunking structure of the parser. How-
ever, neither the tools used to simplify sentences
nor the resulting improvement in readability has
been formally measured.
Second, simplification of Japanese claim sen-
tences has been addressed through a rule-based
method (Shinmori et al., 2003). It identifies the
discourse structure of a claim using cue phrases
and lexico-syntactic patterns. Then it paraphrases
each discourse segment.
Third, a claim simplification method to para-
phrase and summarise text has been intro-
duced (Bouayad-Agha et al., 2009). It is multilin-
gual and consists of claim segmentation, corefer-
ence resolution, and discourse tree derivation. In
claim segmentation, a rule-based system is com-
pared to machine learning with the conclusion of
the former approach outperforming the latter. The
machine learning approach is, however, very sim-
ilar to the clause segmentation task described in
this paper. The main difference is in evaluation.
These authors use the cosine similarity to calculate
a 1:1 term overlap between the automated solution
and gold standard set whereas we assess whether
a token is an accurate segment boundary or not.
Figure 2: Example of the claim segmentation experiments
ANNI Tokenizer RegEx Sentence Splitter OpenNLP
POS Tagger Noun Phrase Chenker JAPE
Figure 3: GATE pipeline for Task 1
We continue to discussing a complementary
method to our approach of improving the read-
ability of claims through their clearer presentation
without modifying the text itself. This work by
(Shinmori et al., 2012) is inspired by the fact that
claims must be understood in the light of the def-
initions provided in the description section of the
patents. It aims to enrich the content by aligning
claim phrases with relevant text from the descrip-
tion section. For the evaluation, the authors have
inspected 38 patent documents. The automated
method generates 35 alignments for these docu-
ments (i.e., twenty correct and fifteen false) and
misses only six. It would be interesting to test if
this alignment method and the claim segmentation
proposed in this paper complement each other.
We end by noting that the task of segmenting
claim phrases is similar to the task of detecting
phrase boundaries by Sang and D´
ejean (2001) in
the sense that the segments we want to identify are
intra-sentential. However, the peculiar syntactic
style of claims makes the phrase detection strate-
gies not applicable (see Ferraro (2012) for a de-
tailed study on the linguistic idiosyncrasy of patent
3 Materials and methods
In this paper, we performed statistical experiments
and qualitative error analyses related to two seg-
mentation tasks (see Figure 2):
1. Segmenting claims section to the components
for preamble, transition, and body text.
2. Segmenting each claim to subordinate and
coordinate clauses.
For Task 1, we developed a rule-based method
using the General Architecture for Text Engineer-
ing (GATE) (Cunningham et al., 2011). The sys-
tem had three rules, one for each of the claim parts
we were interested in identifying. The rules were
written in terms of JAPE grammars.2In order to
process the rules, the GATE pipeline illustrated in
Figure 3 was applied. Because transitions shound
match with the first instance of a closed set of key-
words (we used comprise,consist,wherein,char-
acterize,include,have, and contain), our first rule
identified a transition and, using its boundary in-
dices, we restricted the application of our further
rules. This resulted in the following application
transition preamble body text.
This first rule was applied to the complete
dataset (training, development, and test sets
merged into one single dataset) described in Table
2. Our two other rules relied on lexico-syntactic
patterns and punctuation marks. Note that even
though punctuation conventions have been devel-
2JAPE, a component of GATE, is a finite state transducer
that operates over annotations based on regular expressions.
Table 2: Dataset demographics
# claims # segments # words
Training set 811 4397 48939
Development set 10 15 260
Test set 80 491 5517
oped for claim writing (see Section 2.1), their fol-
lowing is not mandatory. This led us to experiment
these more complex rules.
For Task 2, our method was based on supervised
machine learning (ML). To train this ML classi-
fier, we used a patent claim corpus annotated with
clause boundaries. This corpus was provided by
the TALN Research Group from Universitat Pom-
peu Fabra. The aim of the segmentation classifier
was to decide whether a claim token is a segment
boundary or not, given a context. Thus, every to-
ken was seen as a candidate for placing a segment
boundary. Following standard ML traditions, we
split the dataset in training,development, and test
sets (Tables 2 and 1).
The corpus was analysed with a transitional3
version of Bohnet’s parser (Bohnet and Kuhn,
2012). It was one of the best parsers in the CoNLL
Shared Task 2009 (Hajiˇ
c et al., 2009).
In order to characterise the clause boundaries,
the following features were used for each token in
the corpus:
- lemma of the current token,
- part-of-speech (POS) tag4of the current to-
ken as well as POS-tags of the two immedi-
ately preceding and two immediately subse-
quent words,
- syntactic head and dependent of the current
token, and
- syntactic dependency relation between the
current token and its head and dependent to-
3Patent claim sentences can be very long which im-
plies long-distance dependencies. Therefore, transition-
based parsers, which typically have a linear or quadratic com-
plexity (Nivre and Nilsson, 2004; Attardi, 2006), are better
suited for parsing patent sentences than graph-based parsers,
which usually have a cubic complexity.
4The POS-tag corresponds to the Peen Tree Bank tag set
(Marcus et al., 1993) whereas IN = preposition or conjunc-
tion, subordinating; CC = Coordinating Conjunction; VBN =
Verb, past participle; VBG = verb, gerund or present partici-
ple; WRB = Wh-adverb.
Table 3: The most frequent lemmas and POS-tags
in the beginning of a segment.
Rank Lemmas Abs. Freq. Rel. Freq.
1 and 675 0.137
2 wherein 554 0.112
3 for 433 0.088
4 which 174 0.035
5 have 158 0.032
6 to 155 0.031
7 characterize 152 0.031
8 a 149 0.030
9 the 142 0.028
10 say 140 0.028
11 is 64 0.013
12 that 62 0.012
13 form 59 0.012
14 in 58 0.011
15 when 56 0.011
Rank POS-tag Abs. Freq. Rel. Freq.
1 IN 739 0.150
2 CC 686 0.139
3 VBN 511 0.104
4 VBG 510 0.104
5 WRB 409 0.083
Moreover, the fifteen most frequent lemmas and
five most frequent POS-tags and punctuation
marks were used as features we called segmenta-
tion keywords (Table 3).
For classification we used the CRF++ toolkit,
an open source implementation of conditional ran-
dom fields (Lafferty et al., 2001). This framework
for building probabilistic graphical models to seg-
ment and label sequence data has been success-
fully applied to solve chunking (Sha and Pereira,
2003), information extraction (Smith, 2006), and
other sequential labelling problems. We compared
the results obtained by CRF++ with the following
-Baseline 1: each punctuation mark is a seg-
ment boundary, and
-Baseline 2: each punctuation mark and key-
word is a segment boundary.
Performance in Task 1 was assessed using the
accuracy. Due to the lack of a corpus anno-
tated with claims components, we selected twenty
claims randomly and performed the annotation
ourselves (i.e., one of the authors annotated the
claims). The annotator was asked to assess
whether the beginning and ending of a claim com-
ponent was successfully identified.
Performance in Task 2 was evaluated using the
precision,recall, and F-score on the test set. We
Table 4: Evaluation of claim components
Correct Incorrect
Preamble Beginning 100% 0%
End 97% 3%
Transition Beginning 94% 6%
End 100% 0%
Body text Beginning 100% 0%
End 100% 0%
Table 5: Evaluation of claim clauses
Precision Recall F-score
Baseline 1 0.2% 0.3% 2.6%
Baseline 2 41% 29% 34%
CRF++ 77% 76% 76%
considered that clause segmentation is a precision
oriented task, meaning that we emphasised the de-
mand for a high precision on the expense of a pos-
sibly more modest recall.
In order to better understand errors in clause
segmentation, we analysed errors qualitatively us-
ing content analysis (Stemler, 2001). This method
is commonly used in evaluation of language tech-
nologies. Fifty segmentation errors were ran-
domly selected and manually analysed by one of
the authors.
4 Results and discussion
4.1 Statistical performance evaluation in
Tasks 1 and 2
We achieved a substantial accuracy in Task 1,
claim component segmentation (Table 4). That
is, the resulting segmentation was almost perfect.
This was not surprising since we were processing
simple and well defined types of segments. How-
ever, there was a small mismatch in the bound-
ary identification for the preamble and the transi-
tion segments.
Our ML method clearly outperformed both its
baselines in Task 2 (Table 5). It had the precision
of 77 per cent and recall of 76 per cent in clause
segmentation. The respective percentages were 41
and 29 for the baseline based on both punctuation
and keywords. If punctuation was used alone, both
the precision and recall were almost zero.
4.2 Qualitative analysis of errors in Task 2
The most common errors in clause segmentation
were due to two reasons: first, ambiguity in co-
ordinating conjunctions (e.g., commas as wll as
and,or, and other particles) and second, consec-
utive segmentation keywords.
Segmentation errors caused by ambiguous coor-
dinating conjunctions were due to the fact that not
all of them were used as segment delimiters. Let
us illustrate this with the following automatically
segmented claim fragment with two coordinating
conjunctions (a segment is a string between square
brackets, the integer sub-script indicating the seg-
ment number, and the conjunctions in italics):
. .. [said blade advancing member comprises a worm
rotatable by detachable handle]1[or key]2[and meshin-
georm wheel secured to a shift]3. . .
In this example, the two conjunctions were con-
sidered as segment delimiters which resulted in
an incorrect segmentation. The correct analysis
would have been to maintain the fragment as a sin-
gle segment since simple noun phrases are not an-
notated as individual segments in our corpus.
Segmentation errors due to consecutive seg-
mentation keywords resulted in undesirable seg-
ments only once in our set of fifty cases. This hap-
pened because the classifier segmented every en-
counter with a segmentation keyword, even when
the keywords were consecutive. We illustrate this
case with the following example, which contains
two consecutive keywords, a verb in past partici-
ple (selected) and a subordinate conjunction (for).
Example (a) shows a wrong segmentation, while
example (b) shows its correct segmentation.
. . . (a) [said tool to be]1[selected]2[for the next work-
ing operation]3. . .
. . . (b) [said tool to be selected]1[for working]2...
In general, correcting both these error types
should be relatively straightforward. First, to solve
the problem of ambiguous commas, a possible so-
lution could be to constrain their application as
keywords, for example, by combining commas
with other context features. Second, segmentation
errors caused by consecutive segmentation key-
words could be solved, for example, by applying a
set of correction rules after the segmentation algo-
rithm (Tjong and Sang, 2001).
5 Conclusion and future work
In this paper we have presented our on-going re-
search on claim readability. We have proposed a
method that focuses on presenting the claims in
a clearer way rather than modifying their text con-
tent. This claim clarity is an important characteris-
tic for inventors, researchers, and other laypeople.
It may also be useful for patent experts, because
clear clauses may help them to avoid future legal
cost due to litigations. Moreover, better capabili-
ties to understand patent documents contributes to
democratisation of the invention and, therefore, to
human knowledge.
For future work, we plan to conduct a user-
centered evaluation study on claim readability. We
wish to ask laypeople and patents experts to as-
sess the usability and usefulness of our approach.
Furthermore, we plan to consider text highlight-
ing, terminology linking to definitions, and other
content enrichment functionalities as ways of im-
proving claim readability.
NICTA is funded by the Australian Government
through the Department of Communications and
the Australian Research Council through the ICT
Centre of Excellence Program. We also express
our gratitude to the TALN Research Group from
Universitat Pompeu Fabra for their corpus devel-
opment. Finally, we thank the anonymous review-
ers of The 3rd Workshop on Predicting and Im-
proving Text Readability for Target Reader Popu-
lations (PITR 2014), held in conjunction with the
14th Conference of the European Chapter of the
Association for Computational Linguistics (EACL
2014), for their comments and suggestions.
D. Alberts, C. Barcelon Yang, D. Fobare-DePonio,
K. Koubek, S. Robins, M. Rodgers, E. Simmons,
and D. DeMarco. 2011. Introduction to patent
searching. In M Lupu, J Tait, . Mayer, and A J
Trippe, editors, Current Challenges in Patent In-
formation Retrieval, pages 3–44, Toulouse, France.
G. Attardi. 2006. Experiments with a multilan-
guage non-projective dependency parser. In XXX:
please add, editor, Proceedings of the Tenth Confer-
ence on Computational Natural Language Learning,
CoNLL-X ’06, pages 166–170, Stroudsburg, PA,
USA. Association for Computational Linguistics.
B. Bohnet and J. Kuhn. 2012. The best of both worlds:
a graph-based completion model for transition-
based parsers. In XXX: please add, editor, Proceed-
ings of the 13th Conference of the European Chap-
ter of the Association for Computational Linguistics,
EACL ’12, pages 77–87, Stroudsburg, PA, USA. As-
sociation for Computational Linguistics.
N. Bouayad-Agha, G. Casamayor, G. Ferraro, S. Mille,
V. Vidal, and Leo Wanner. 2009. Improving the
comprehension of legal documentation: the case of
patent claims. In XXX: please add, editor, Proceed-
ings of the 12th International Conference on Arti-
ficial Intelligence and Law, ICAIL ’09, pages 78–
87, New York, NY, USA. Association for Comput-
ing Machinery.
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,
N. Aswani, I. Roberts, G. Gorrell, A. Funk,
A. Roberts, D. Damljanovic, T. Heitz, M. A. Green-
wood, H. Saggion, J. Petrak, Y. Li, and W. Pe-
ters. 2011. Text Processing with GATE (Version
6). XXX: please add, XXX: please add. http:
//, accessed 28 Feb,
G. Ferraro. 2012. Towards Deep Content Extrac-
tion: The Case of Verbal Relations in Patent Claims.
PhD Thesis. Department of Information and Com-
munication Technologies, Pompeu Fabra Univesity,
XXX: please add.
J. Hajiˇ
c, M. Ciaramita, R. Johansson, D. Kawahara,
M. A. Mart, L. M´
arquez, A. Meyers, J. Nivre, S. Pad,
J. Stepanek, et al. 2009. The CoNLL-2009 shared
task: syntactic and semantic dependencies in multi-
ple languages. In XXX: please add, editor, Proceed-
ings of the Thirteenth Conference on Computational
Natural Language Learning: Shared Task, page 118,
XXX: please add. please add.
K. Inui, A. Fujita, T. Takahashi, R. Iida, and T. Iwakura.
2003. Text simplification for reading assistance: A
project note. In XXX: please add, editor, In Pro-
ceedings of the 2nd International Workshop on Para-
phrasing: Paraphrase Acquisition and Applications,
IWP ’03, pages 9–16, XXX: please add. please add.
D Lafferty, A McCallum, and F C N Pereira. 2001.
Conditional random fields: Probabilistic models
for segmenting and labeling sequence data. In
XXX: please add, editor, Proceedings of the Eigh-
teenth International Conference on Machine Learn-
ing, ICML ’01, pages 282–289, San Francisco, CA,
USA. Morgan Kaufmann Publishers Inc.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz.
1993. Building a large annotated corpus of En-
glish: The Penn Treebank. Computational Linguis-
tics, 19(2):313–330.
J. Nivre and J. Nilsson. 2004. Memory-based depen-
dency parsing. In XXX: please add, editor, Proceed-
ings of the Eight Conference on Computational Nat-
ural Language Learning, CoNLL ’04, please add.
please add.
D. Pressman. 2006. Patent It Yourself. Nolo, Berkeley,
D V Radack. 1995. Reading and understanding patent
claims. JOM, 47(11):69–69.
E T K Sang and H D´
ejean. 2001. Introduction to the
CoNLL-2001 shared task: Clause identification. In
W. Daelemans and R. Zajac, editors, Proceedings of
the Fith Conference on Computational Natural Lan-
guage Learning, volume 7 of CoNLL ’01, pages 53–
57, Toulouse, France. XXX: please add.
F. Sha and F. Pereira. 2003. Shallow parsing with con-
ditional random fields. In XXX: please add, edi-
tor, Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics on Human Language Technology,
volume 1 of NAACL ’03, pages 134–141, Strouds-
burg, PA, USA. Association for Computational Lin-
S. Sheremetyeva. 2003. Natural language analysis of
patent claims. In XXX: please add, editor, Proceed-
ings of the ACL 2003 Workshop on Patent Process-
ing, ACL ’03, Stroudsburg, PA, USA. Association
for Computational Linguistics.
A. Shinmori, M. Okumura, Y. Marukawa, and
M. Iwayama. 2003. Patent claim processing for
readability: structure analysis and term explanation.
In XXX: please add, editor, Proceedings of the ACL-
2003 Workshop on Patent Corpus Processing, vol-
ume 20 of PATENT ’03, pages 56–65, Stroudsburg,
PA, USA. Association for Computational Linguis-
A. Shinmori, M. Okumura, and Marukawa. 2012.
Aligning patent claims with the ”detailed descrip-
tion” for readability. Journal of Natural Language
Processing, 12(3):111–128.
A. Smith. 2006. Using Gazetteers in discriminative
information extraction. In XXX: please add, edi-
tor, Proceedings of the Tenth Conference on Compu-
tational Natural Language Learning, CoNLL ’06,
pages 10–8, please add. please add.
S Stemler. 2001. An overview of content analy-
sis. Practical Assessment, Research and Evaluation,
H Suominen, S Salantera, S Velupillai, W W Chap-
man, G Savova, N Elhadad, S Pradhan, B R South,
D L Mowery, G J F Jones, J Leveling, L Kelly,
L Goeuriot, Da Martinez, and Gu Zuccon. 2013.
Overview of the ShARe/CLEF eHealth Evaluation
Lab 2013. In Pa Forner, H M¨
uller, R Parades,
P Rosso, and B Stein, editors, Information Access
Evaluation: Multilinguality, Multimodality, and Vi-
sualization. Proceedings of the 4th International
Conference of the CLEF Initiative, volume 8138 of
Lecture Notes in Computer Science, pages 212–231,
Heidelberg, Germany. Springer.
E. F. Tjong and Kim Sang. 2001. Memory-based
clause identification. In XXX: please add, edi-
tor, Proceedings of the 2001 workshop on Com-
putational Natural Language Learning - Volume 7,
ConLL ’01, Stroudsburg, PA, USA. Association for
Computational Linguistics.
... patent documents are typical low-readability technical texts. Considerable research has been done to improve claim readability by modification-based approaches, e.g., simplification, paraphrasing, and summarization, and clarifying-presentation-based approaches [3]. In this paper, we propose an information-extraction-based technique to grasp the patent claim structure for special user group of patent engineers. ...
... The transition is the part for showing the condition of listed features with regard to the subject matter in the preamble. The body is the part for listing the features of the invention 3 . Each invention element in the body text is separated by a semicolon [11]. ...
... The advantage of this approach is that the claim text is simplified so that patent engineers can understand it in a shorter time. However, there is a risk of changing the meaning of the text [3]. The second approach is presentation clarification. ...
Conference Paper
Full-text available
Patent engineers are spending significant time analyzing patent claim structures to grasp the range of technology covered or to compare similar patents in the same patent family. Though patent claims are the most important section in a patent, it is hard for a human to examine them. In this paper, we propose an information-extraction-based technique to grasp the patent claim structure. We confirmed that our approach is promising through empirical evaluation of entity mention extraction and the relation extraction method. We also built a preliminary interface to visualize patent structures, compare patents, and search similar patents.
... Given the Claim's legal nature, however, the extent of the modification is crucial, and previous approaches' views to the task have varied widely. Ferraro et al. (2014), for example, aim at improving the Claim's presentation without modifying its text. They segment each claim into preamble, transition, and body (rule-based) and then further divide the body into clauses using a Conditional Random Field. ...
... A segmented patent. Adapted from(Ferraro et al., 2014). ...
Full-text available
We survey Natural Language Processing (NLP) approaches to summarizing, simplifying, and generating patents' text. While solving these tasks has important practical applications - given patents' centrality in the R&D process - patents' idiosyncrasies open peculiar challenges to the current NLP state of the art. This survey aims at a) describing patents' characteristics and the questions they raise to the current NLP systems, b) critically presenting previous work and its evolution, and c) drawing attention to directions of research in which further work is needed. To the best of our knowledge, this is the first survey of generative approaches in the patent domain.
... On the other hand, statistical-based machine learning is frequently applied for processing patent analysis in recent years. Gabriela [16] proposed a two-stage method of rulebased claim paragraph segmentation and machine learningbased of conditional random field (CRF) lengthy sentence segmentation which will help automatically detect division phrases for forming meaningful shorter sentences. Wang et al. [17] present an approach to extracting principle knowledge from process patents classifying with contraction matrix. ...
Full-text available
Research on relation extraction from patent documents, a high-priority topic of natural language process in recent years, is of great significance to a series of patent downstream applications, such as patent content mining, patent retrieval, and patent knowledge base constructions. Due to lengthy sentences, crossdomain technical terms, and complex structure of patent claims, it is extremely difficult to extract open triples with traditional methods of Natural Language Processing (NLP) parsers. In this paper, we propose an Open Relation Extraction (ORE) approach with transforming relation extraction problem into sequence labeling problem in patent claims, which extract none predefined relationship triples from patent claims with a hybrid neural network architecture based on multihead attention mechanism. The hybrid neural network framework combined with Bi-LSTM and CNN is proposed to extract argument phrase features and relation phrase features simultaneously. The Bi-LSTM network gains long distance dependency features, and the CNN obtains local content feature; then, multihead attention mechanism is applied to get potential dependency relationship for time series of RNN model; the result of neural network proposed above applied to our constructed open patent relation dataset shows that our method outperforms both traditional classification algorithms of machine learning and the-state-of-art neural network classification models in the measures of Precision, Recall, and F1.
We survey Natural Language Processing (NLP) approaches to summarizing, simplifying, and generating patents’ text. While solving these tasks has important practical applications – given patents’ centrality in the R&D process – patents’ idiosyncrasies open peculiar challenges to the current NLP state of the art. This survey aims at (a) describing patents’ characteristics and the questions they raise to the current NLP systems, (b) critically presenting previous work and its evolution, and (c) drawing attention to directions of research in which further work is needed. To the best of our knowledge, this is the first survey of generative approaches in the patent domain.
One of the most crucial tasks in patent analysis concerns valuating patents as intellectual properties which embody financial assets, and yield license fees or even competitive advantages. There are several approaches that deal with finding indicators of a patent's value, some in terms of patent scope. So far, the available work is incomplete in various respects. Overcoming deficits caused by an insufficient use of bibliometric indicators, this paper provides a normalized technological patent scope indicator through a semantic patent analysis of patent claims. By providing regressions between the patent scope and several indicators, this paper shows that the patent scope in the case of the three technologies DVD, HD-DVD and Blu-ray disc, follows several theories of prior work. This work tackles theoretical implications, as it manages to operationalize Knight's theory (hash mark analogy of broad and narrow patent claims) by a data driven approach and enhances recent work by means of a normalized semantic patent scope, based on patent claims instead of bibliometric data. Finally, managers may profit from a higher resolution regarding decision making for competitor analysis and M&A-questions, as this new indicator describes how broadly assignees define their technologies semantically, thereby offering a source for a patent's value.
Conference Paper
Full-text available
Clarity of language improves efficiency and reduces misunderstanding. In written text, it is measured by readability and with patent documents , this readability is known to be particularly poor in the case of layperson-users without specialized knowledge in this subject. Here we introduce a 46-question survey, founded socio-technical theories of information technology (IT) use and users, to measure linguistic complexity and its reduction by IT on a patent web-site. 65 participants have taken the survey and their responses indicate that the patent language is complex for laypeople but reducible by IT that processes long claims sections and sentences; claim words with a specific meaning; the claim dependency structure; and patent classification codes. Supplementing current patent websites with these reading aids could unlock their valuable information to the general public.
Conference Paper
Full-text available
In the paper, authors proposed a methodology to solve the problem of prior art patent search, consists of a statistical and semantic analysis of patent documents, machine translation of patent application and calculation of semantic similarity between application and patents. The paper considers different variants of statistical analysis based on LDA method. On the step of the semantic analysis, authors applied a new method for building a semantic network on the base of Meaning-Text Theory. Prior art search also needs pre-translation of the patent application using machine translation tools. On the step of semantic similarity calculation, we compare the semantic trees for application and patent claims. We developed an automated system for the patent examination task, which is designed to reduce the time that an expert spends for the prior-art search and is adopted to deal with a large amount of patent information.
Full-text available
This research brings together data analysis with software engineering and visualisation, with a specific focus on text mining and large document collections. My aim is to devise new, rich, and simple visualisation interfaces, which I call deep interfaces. With deep interfaces I introduce the idea-rich content as a product of the statistical analysis combined with human curation of labels and interpreted as a flow of subjectivity, complexity, and diversity between reader and interface and vice versa. The focus of such interfaces is not the representation of textual document collections as in Moretti’s distant reading, but to revisit traditional reading from the point of view of state of the art methods of textual analysis. Thus, the proposed interfaces can help us discover and explore text document collections by reading their contents. This is a practice-led research project that develops theoretical issues through the generation of practical artefacts. The research process is cu- mulative, following a reflexive methodology. The key outcomes of the project are embodied in an interface to a large collection of ANZAC war diaries: Diggers’ Diaries —
Full-text available
Parsing natural language is an essential step in several applications that involve document analysis, e.g. knowledge extraction, question answering, summarization, filtering. The best performing systems at the TREC Question Answering track employ parsing for analyzing sentences in order to identify the query focus, to extract relations and to disambiguate meanings of words.
Conference Paper
Full-text available
For the 11th straight year, the Conference on Computational Natural Language Learn- ing has been accompanied by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2009, the shared task was dedicated to the joint parsing of syntac- tic and semantic dependencies in multiple lan- guages. This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task. In this paper, we define the shared task, describe how the data sets were created and show their quantitative properties, report the results and summarize the approaches of the participating systems.
This chapter introduces patent search in a way that should be accessible and useful to both researchers in information retrieval and other areas of computer science and professionals seeking to broaden their knowledge of patent search. It gives an overview of the process of patent search, including the different forms of patent search. It goes on to describe the differences among different domains of patent search (engineering, chemicals, gene sequences and so on) and the tools currently used by searchers in each domain. It concludes with an overview of open issues.
Conference Paper
Discharge summaries and other free-text reports in healthcare transfer information between working shifts and geographic locations. Patients are likely to have difficulties in understanding their content, because of their medical jargon, non-standard abbreviations, and ward-specific idioms. This paper reports on an evaluation lab with an aim to support the continuum of care by developing methods and resources that make clinical reports in English easier to understand for patients, and which helps them in finding information related to their condition. This ShARe/CLEFeHealth2013 lab offered student mentoring and shared tasks: identification and normalisation of disorders (1a and 1b) and normalisation of abbreviations and acronyms (2) in clinical reports with respect to terminology standards in healthcare as well as information retrieval (3) to address questions patients may have when reading clinical reports. The focus on patients’ information needs as opposed to the specialised information needs of physicians and other healthcare workers was the main feature of the lab distinguishing it from previous shared tasks. De-identified clinical reports for the three tasks were from US intensive care and originated from the MIMIC II database. Other text documents for Task 3 were from the Internet and originated from the Khresmoi project. Task 1 annotations originated from the ShARe annotations. For Tasks 2 and 3, new annotations, queries, and relevance assessments were created. 64, 56, and 55 people registered their interest in Tasks 1, 2, and 3, respectively. 34 unique teams (3 members per team on average) participated with 22, 17, 5, and 9 teams in Tasks 1a, 1b, 2 and 3, respectively. The teams were from Australia, China, France, India, Ireland, Republic of Korea, Spain, UK, and USA. Some teams developed and used additional annotations, but this strategy contributed to the system performance only in Task 2. The best systems had the F1 score of 0.75 in Task 1a; Accuracies of 0.59 and 0.72 in Tasks 1b and 2; and Precision at 10 of 0.52 in Task 3. The results demonstrate the substantial community interest and capabilities of these systems in making clinical reports easier to understand for patients. The organisers have made data and tools available for future research and development.
Patent specifications consist of patent claims and de- tailed descriptions. While patent claims are the most impor- tant part of patent specifications, they are compositionally or combinationally described and difficult to read. By align- ing patent claims with detailed description, the readability of patent claims can be improved because paraphrases for the claims can be found. In this paper, we propose a method to align patent claims with detailed descriptions by analyz- ing the structure of claims to get core elements of claims, aligning between each core element in the claim and each sentence in the detailed description, and filtering the result based on the existence of "effectiveness expressions" in the sentence.
Much work on information extraction has successfully used gazetteers to recognise uncommon entities that cannot be reliably identified from local context alone. Approaches to such tasks often involve the use of maximum entropy-style models, where gazetteers usually appear as highly informative features in the model. Although such features can improve model accuracy, they can also introduce hidden negative effects. In this paper we describe and analyse these effects and suggest ways in which they may be overcome. In particular, we show that by quarantining gazetteer features and training them in a separate model, then decoding using a logarithmic opinion pool (Smith et al., 2005), we may achieve much higher accuracy. Finally, we suggest ways in which other features with gazetteer feature-like behaviour may be identified.