Dmitriy M. Korobkin, Sergey S. Fomenkov, Alla G. Kravets, Sergey. G. Kolesnikov
Volgograd State Technical University
Russia, Volgograd, Lenin av., 28,
In paper authors proposed a methodology to solve problem of prior art patent search, consists of statistical and semantic
analysis of patent documents, and calculation of semantic similarity between application and patents on base of Subject-
Action-Object (SAO) triples. The paper considers a description of statistical analysis based on LDA method and
MapReduce paradigm. On the step of semantic analysis authors applied a new method for building semantic
representation of SAO on base of Meaning-Text Theory. On the step of semantic similarity calculation we compare the
SAOs from application and patent claims. We developed an software for the patent examination task, which is designed
to reduce the time that an expert spends for the prior-art candidate search. This research was financially supported by the
Russian Fund of Basic Research (grants No. 15-07-09142 A, No. 15-07-06254 A, No. 16-07-00534 А).
Prior-art patent search, patent examination, LDA, semantic analysis, natural language processing, SAO, big data.
From year to year the number of patent applications is increasing. Around 2.9 million patent applications
were filed worldwide in 2015, up 7.8% from 2014. The escalating applications flow and more than 20 million
World set of granted patents (from 1980 to 2015) increase the time that patent examiners have to spend to
examine all incoming applications. Sometimes examiner has to make hundreds of search queries and to
process thousands of existing patents manually during the examination procedure to make a decision: to
approve the application or to reject it. The increasing workload of patent offices led to need for developing of
new approaches for patents prior-art retrieval on base of statistical and semantic methods of natural language
Many scientists tried to solve patent prior-art search task. Magdy used an approach based on unigrams
and bigrams [1], Verma’s approach is based on keyphrase and citations extraction [2], Mahdabi used method
based on a time-aware random walk on a weighted network of patent citations [3], Xue’s approach considers
an actual query as the basic unit and thus captures important query-level dependencies between words and
phrases [4], D'hondt tried to compare flat classification with a two-step hierarchical system which models the
IPC hierarchy [5], Bouadjenek used query with a full patent description in conjunction with generic query
reduction methods [6], Kim proposed the method to suggest diverse queries that can cover multiple aspects
of the query (patent) [7], Ferraro's approach consist in segmenting the patent claim, using a rule-based
approach, and a conditional random field is trained to segment the components into clauses [8], Andersson
used the techniques by addressing three different relation extraction applications: acronym extraction,
hyponymy extraction and factoid entity relation extraction [9], Park uses SAO (subject-action-object) for
identifying patent infringement [10], Yufeng combines the SAO structure with the VSM module to calculate
the patent similarity [11], Choi uses SAO-based text mining approach to building a technology tree [12],
authors this paper proposed the method of technical function discovery in patent databases [13,14].
In this paper we propose a novel approach, in which we tried to combine both statistical and semantic
features to increase the accuracy of prior-art search.
The basic idea of LDA [15] - patents are represented as random mixtures of latent topics, where each
topic is characterized by a distribution according to the words from documents array. Based on the LDA
model can be used a patent database statistical analysis and distribution of patents by the unnamed topics.
The first stage of the statistical analysis is the tokenization of the patent text. In this case the tokens will
be individual words or N-grams from the patent text. After tokenization is necessary to make the
lemmatization, i.e. converting the extracted words to their base form for the most accurate building of term-
document matrix. The Big data processing framework Apache Spark and MLlib library for machine learning
allows to get a dictionary of all words from the documents set and build a term-document matrix.
The statistic analysis software [16] produces a patent search by different methods and evaluates the
efficiency of patent search. On base of statistical analysis testing results can be concluded: the most effective
method for vectors storage is storing in HDFS, comparing the distribution vectors of patents by topics on
base Cosine similarity, patent tokenization with replacing synonyms. Tokenization with replacing synonyms
includes the removal of stop-words, but beyond that, all words-synonyms are present to one word (base
alias). This approach allows us to build the most accurate term-document matrix and the LDA model but
slow down compared to the usual tokenization.
On this step we perform a dependency trees construction for application and patent claims [18]. The text
of patent claims has one feature that makes the effective use of existing solutions for building dependency
trees is difficult. This feature is that the patent claims are written in one sentence, which sometimes includes
hundreds of words. To solve this problem has been developed algorithm of complex sentences segmentation
[19]. Sentences are segmented on base of transitional phrases of claims and special “marker” phrases such as
“, wherein”, “, said”; “, and”; “; and”; “, thereby”; “thereby”; “such that”; “so that”; “wherein”; “whereby”;
“where”; “when”; “wile”; etc. For remove numbering like “A.”, “a.”, “1.” or “1)” we used the regular
expression: «^(\d{1,4}|[a-zA-Z]{1,2})(\.|\))\s», for remove the references like «4. The device according to
claim 3...» we used the regular expression: «^.+(of|in|to) claim \d+(, )?». The sentences are separate by
punctuation, such as: «:», «;», «.», «!», «?», with used the regular expression «(\.|!|\?|:|;)\s?».
Then, we perform a morphological analysis of the patent text with TreeTagger [20] for Russian and
English language. After that we used MaltParser [21] to perform semantic parsing (building a dependency
Now let us consider the semantic analysis of the sentence: “The second end electrode and the center
electrode of the first tubular body are electrically connected to the center electrode and the first end electrode,
respectively, of the second tubular body” [22].
Table 1. Example of dependence tree in Conll’09 format
index word in
sentence word form itself word's lemma or
stem POS index of syntactic
parent Stanford typed dependencies
1 the the DT 4 det
2 second second JJ 4 amod
3 end end NN 4 nn
4 electrode electrode NN 16 nsubj
5 and and CC 4 cc
6 the the DT 8 det
7 center center NN 8 nn
8 electrode electrode NN 4 conj
9 of of IN 8 prep
10 the the DT 13 det
11 first first JJ 13 amod
12 tubular tubular JJ 13 amod
13 body body NN 9 pobj
14 are be VBP 16 cop
15 electrically electrically RB 16 advmod
16 connected connect VVN 0 null
Part-of-speech (POS) tags [23] are assigned to a single word according to its role in the sentence: the verb
(base form (VV), past participle (VVN), verb “be”, present, non-3rd person (VBP)), the noun (NN), adjective
(JJ), preposition (IN), determiner (DT), adverb (RB), coordinating conjunction (CC), etc.
The Stanford typed dependencies [24] representation was designed to provide a simple description of the
grammatical relationships in a sentence: “amod” - adjectival modifier, “det” – determiner, “pobj” - object of
a preposition, “nsubj” - nominal subject, “cop” - copula, “cc” - coordination, “conj” - conjunct, “nn” - noun
compound modifier, “advmod” - adverb modifier, “prep” - prepositional modifier, “punct” - punctuation).
In the collapsed representation, dependencies involving prepositions, conjuncts, as well as information
about the referent of relative clauses are collapsed to get direct dependencies between content words. We
removed from dependence trees the grammatical relations such as “punct”, “det”, “prep”, “cc”, etc. The
parent of removed node is transferred to the child node, фnd the indexes of words in the sentence are
nsubj(connected-16, electrode-4)
cc(electrode-4, and-5)
det(electrode-8, the-6)
nn(electrode-8, center-7)
conj(electrode-4, electrode-8)
nsubj(connected-10, electrode-3)
nsubj(connected-10, electrode-8)
Authors used an approach based on the MTT (Meaning-Text Theory) [25]. According to MMT collapsed
Stanford Dependencies (SD) merge into the set of Deep Syntactic relations (Table 2).
Table 2. Transformation from Stanford Dependencies to Deep Syntactic Relations
Collapsed Stanford Dependencies Deep Syntactic Structure
amod(electrode-3, second-1) ATTR(electrode-3, second-1)
nn(electrode-3, end-2) ATTR(electrode-3, end-2)
nsubj(connected-10, electrode-3) I(connected-10, electrode-3)
nn(electrode-5, center-4) ATTR(electrode-5, center-4)
nsubj(connected-10, electrode-5) I(connected-10, electrode-5)
amod(body-8, first-6) ATTR(body-8, first-6)
amod(body-8, tubular-7) ATTR(body-8, tubular-7)
pobj(electrode-5, body-8) II(electrode-5, body-8)
advmod(connected-10, electrically-9) ATTR(connected-10, electrically-9)
root (ROOT-0, connected-10) root (ROOT-0, connected-10)
nn(electrode-12, center-11) ATTR(electrode-12, center-11)
pobj(connected-10, electrode-12) II(connected-10, electrode-12)
amod(electrode-15, first-13) ATTR(electrode-15, first-13)
nn(electrode-15, end-14) ATTR(electrode-15, end-14)
pobj(connected-10, electrode-15) II(connected-10, electrode-15)
advmod(connected-10, respectively-16) ATTR(connected-10, respectively-16)
amod(body-19, second-17) ATTR(body-19, second-17)
amod(body-19, tubular-18) ATTR(body-19, tubular-18)
pobj(connected-10, body-19) II(connected-10, body-19)
According to MMT we transformed Stanford Dependencies (SD) into the Deep syntactic relations
Actantial relations: the relation I – Subject and all its transform, nominal and agentive complement:
“nsubj”, “nsubjpass”, “csubj”, “csubjpass”, etc; : the relation II – Direct Object, Oblique (Prepositional)
Object, predicative complement, complement of an adjective, a preposition and a conjunction: “dobj”,
“iobj”, “pobj”, etc; the relation III uses with semantically trivalent transitive verb or it’s a SD such as
pobj_in, pobj_on, etc. (used together with prepositions).
The attributives relations (ATTR) cover all types of modifiers, determines, quantifiers, relative clauses,
circumstantials: “amod”, “advmod”, “nn”, etc.
On the this step, we compare SAO (which are represented as semantic trees) for application claims with
trees from selected subset received on the step of statistical analysis. We re-rank relevant patents from
selected subset according to similarities between semantic trees.
In accordance with MTT trees at Deep syntactic representation level show dependency relations between
terms (words) and look as networks with arrows running from predicate nodes (“action” of SAO) to
argument nodes (“subject” and “object”). For example, SAO of patent application (from sentence “The
second end electrode and the center electrode of the first tubular body are electrically connected to the center
electrode and the first end electrode, respectively, of the second tubular body”) and SAO of patent (from
sentence “The center electrode is connected to the terminal through an internal wire”) have views presented
on Figure 1. At the null level of a SAO representation are the ROOTs (“actions”), at the first level are the
actantial relations I, II (“subject” and “object” respectively), at the second level are the attributive relations.
After the stage of SAO constructing the patent application is compared with each patent in the database.
A comparison of the application with the i-th patent occurs by comparing each of the j-th SAO of the
application with each k-th SAO of the i-th patent.
The first stage of SAO comparison
According to a SAO structure the tree’s root (ROOT) is the verb (“action”). If the ROOTs of the
application and the patent do not match, further comparison of the SAOs is not performed and comparison is
made for the next SAO of the patent application.
If the ROOTs of the application and the patent match, then the attributive (ATTR) structures associated
with the ROOTs (“action”) are compared.
If any terms (words) are not matched, then term from application is checked for significance. Testing the
significance is based on a predetermined table that contains IDF [17] - inverse document frequency of terms
in documents of patent databases. If the term's IDF is above a limit value then the term is not significant and
is not taken into account of the similarity coefficient calculation.
Fig. 1. First stage of SAO comparison for application (left view) and of patent (right view).
We introduce the similarity coefficient of the ATTR structures, associated with the ROOTs (“action”):
, (1)
where TA
, TA
are the semantic sub-trees (attributive (ATTR) structures, associated with the ROOTs) of
k-sentence and l-sentence of an application claim and patent claim accordingly;
, TA
) maximum number of ATTR structures for application claim and patent claim with
verification of term’s significance;
) - MATCH function that determines a similarity of t1 and t2 terms of the compared semantic trees;
– number of terms for semantic tree TA
of application claim.
If the ROOTs of the application and the patent match, the similarity coefficient by “action” is:
, (2)
In application there are 2 such ATTR structures, in patent - 0 ATTR, the maximum is 2. Similar ATTR
structures - 0. Without verification of term’s significance, the similarity coefficient by “action” is equal to 1
(ROOTs are match) + 0/2. With verification of term’s significance (IDF of term “respectively” more then
limit value, then we assume that in the application only 1 ATTR structure) the similarity coefficient by
“action” is equal to 1 (ROOTs are match) + 0/1.
The second stage of SAO comparison
At this stage, the I actantial relations and the attributive ATTR structures, associated with them, are
compared. An example of comparison of application and patent is shown on Figure 2 (sentence from patent:
“the center electrode and the terminal electrode are electrically connected to each other via the glass seal”).
If the I actantial relations (“subject”) of the application and the patent do not match, the coefficient of
similarity by “subject” is equal 0 and comparison is made for the next I “nodes” of the patent application.
If the I actantial relations (“subject”) of the application and the patent match, then the attributive (ATTR)
structures associated with the I “nodes” are compared.
Fig. 2. Second stage of SAO comparison for application (left view) and of patent (right view).
We introduce the similarity coefficient of the attributive (ATTR) structures, associated with the I actantial
relation (“subject”):
, (3)
where TI
, TI
are the semantic sub-trees (attributive (ATTR) structures, associated with the I actant) of k-
sentence and l-sentence of an application claim and patent claim accordingly;
, TI
) maximum number of ATTR structures for application claim and patent claim with
verification of term’s significance;
– number of terms for semantic tree TI
of application claim.
We introduce the coefficient of similarity by “subject”:
, (4)
- similarity coefficient of the i-th I actantial relation,
if match, not match -
- similarity coefficient of the ATTR structures, associated with the i-th I actantial relation,
max (I
, I
) – maximum number of the i-th I actantial relations for application claim and patent claim.
In application there are 2 such ATTR structures, associated with I actantial relations, in patent - 2 ATTR,
the maximum is 2. IDF for all term more then limit value. The maximum coefficient (1.25) is given by a pair
“electrode (ATTR) center, body, first, tubular” “electrode (ATTR) center”. Accordingly, the second pair is
“electrode (ATTR) second, end” “electrode (ATTR) terminal” with coefficient similarity is equal 1.
Similarity coefficient by “subject” is equal (1.25+1)/2=1.125.
The third stage of SAO comparison
At this stage, the II actantial relations (“object”) and the attributive ATTR structures, associated with
them, are compared. It occurs by an algorithm similar to the algorithm used in the second stage.
We introduce the similarity coefficient of the attributive (ATTR) structures, associated with the II
actantial relation (“object”):
, (5)
where TII
are the semantic sub-trees (attributive (ATTR) structures, associated with the II actant) of
k-sentence and l-sentence of an application claim and patent claim accordingly;
) maximum number of ATTR structures for application claim and patent claim with
verification of term’s significance;
– number of terms for semantic tree TII
of application claim.
We introduce the coefficient of similarity by “object”:
, (6)
-similarity coefficient of the i-th II actantial relation,
if match, not match -
- similarity coefficient of the ATTR structures, associated with the i-th II actantial relation,
max (II
, II
) – maximum number of the i-th II actantial relations for application claim and patent claim.
The similarity coefficient of SAO is summarized for each level and the total coefficient of the application
and patent similarity is the sum of trees similarity coefficients:
K ++
=, (7)
The experiments are performed using a multiprocessor computer system with distributed memory
(cluster) of the Volgograd State Technical University. The cluster entered the 22nd edition of the Top-50
rating of the Russian supercomputers.
The software was installed on the nodes of the cluster:
- Apache Spark - open-source cluster-computing framework, engine for large-scale data processing;
- Library for software implementation of the LDA method – Apache Spark MLlib;
- PostgreSQL.
Statistical and semantic portraits were formed for 990,000 Russian- and English-language patents and
stored in the Document Storage on the basis of the HDFS file system.
The patent examination is implemented as MapReduce task (Figure 3). Initially, the input data stored in the
“/input” folder, which contains files with SAO which extracted from patents. The file name is the patent
number. All SAO from patent are stored in one file. The first stage – “only map” task divides each document
into pairs (patent number, SAO), the result is stored in a temporary folder “/output”. This folder serves as input
data for the second stage. Mapper (Compare) receives as input a cached file with the SAOs from patent
application and compares each of SAOs with the SAO from patent (folder “/output”. As a result we get pairs
(patent number, coefficient of similarity), a result is stored to “/output1” folder, from there it is read by the
standard hdfs-command and sorted by the avrg (k) value.
Fig. 3. MapReduce task of patent examination.
For the test set we specified the root folder “/sample”. It contains the folders: “/claims”, “/morph”,
“/semantic”, “/sao”, for a patent application, morphological portrait of patent application, semantic portrait
and extracted SAO, respectively. At the root of the folder “/sample” is stored a file with meta-data of all
We chose the coefficient “Recall” for sets of the top 5, 25, 100 most relevant patents retrieved as a
criterion of the semantic analysis effectiveness. The tables indicate the average Recall value for 10 tests
(Table 3). For software test we perform an imitation of the expert's work. From 10 random patents we extract
the citation lists, for the cited patents we also extract their citation lists and so on up to the fourth nesting
level. As a result, we obtain a test set of approximately 1,000 patents. For each of the 10 selected patents the
patents from the this citation list will be considered as relevant.
Table 3. Semantic analysis w/ & w/o verification of term’s significance
Feature Recall@5
with verification of term’s significance 82 84 92
without verification of term’s significance 78 80 86
Verification the term significance increases the Recall. This is due to a more accurate ranking of the
SAOs similarity, since insignificant, commonly used words do not affect the patents ranking.
We developed software for patent examination on base of extracted triples “subject-action-object” (SAO)
from patent texts. On the step of semantic analysis we applied a new method for building semantic network
on base of Stanford Dependencies and Meaning-Text Theory. On the step of semantic similarity calculation
we compare the SAOs for application and patent claims. Developed software prototype for the patent
examination task significantly reduced search time and increased such criteria of search effectiveness as
