Conference PaperPDF Available

PRIOR ART CANDIDATE SEARCH ON BASE OF STATISTICAL AND SEMANTIC PATENT ANALYSIS

Authors:
PRIOR ART CANDIDATE SEARCH ON BASE OF
STATISTICAL AND SEMANTIC PATENT ANALYSIS
Dmitriy M. Korobkin, Sergey S. Fomenkov, Alla G. Kravets, Sergey. G. Kolesnikov
Volgograd State Technical University
Russia, Volgograd, Lenin av., 28, dkorobkin80@mail.ru
ABSTRACT
In paper authors proposed a methodology to solve problem of prior art patent search, consists of statistical and semantic
analysis of patent documents, and calculation of semantic similarity between application and patents on base of Subject-
Action-Object (SAO) triples. The paper considers a description of statistical analysis based on LDA method and
MapReduce paradigm. On the step of semantic analysis authors applied a new method for building semantic
representation of SAO on base of Meaning-Text Theory. On the step of semantic similarity calculation we compare the
SAOs from application and patent claims. We developed an software for the patent examination task, which is designed
to reduce the time that an expert spends for the prior-art candidate search. This research was financially supported by the
Russian Fund of Basic Research (grants No. 15-07-09142 A, No. 15-07-06254 A, No. 16-07-00534 А).
KEYWORDS
Prior-art patent search, patent examination, LDA, semantic analysis, natural language processing, SAO, big data.
1. INTRODUCTION
From year to year the number of patent applications is increasing. Around 2.9 million patent applications
were filed worldwide in 2015, up 7.8% from 2014. The escalating applications flow and more than 20 million
World set of granted patents (from 1980 to 2015) increase the time that patent examiners have to spend to
examine all incoming applications. Sometimes examiner has to make hundreds of search queries and to
process thousands of existing patents manually during the examination procedure to make a decision: to
approve the application or to reject it. The increasing workload of patent offices led to need for developing of
new approaches for patents prior-art retrieval on base of statistical and semantic methods of natural language
processing.
Many scientists tried to solve patent prior-art search task. Magdy used an approach based on unigrams
and bigrams [1], Verma’s approach is based on keyphrase and citations extraction [2], Mahdabi used method
based on a time-aware random walk on a weighted network of patent citations [3], Xue’s approach considers
an actual query as the basic unit and thus captures important query-level dependencies between words and
phrases [4], D'hondt tried to compare flat classification with a two-step hierarchical system which models the
IPC hierarchy [5], Bouadjenek used query with a full patent description in conjunction with generic query
reduction methods [6], Kim proposed the method to suggest diverse queries that can cover multiple aspects
of the query (patent) [7], Ferraro's approach consist in segmenting the patent claim, using a rule-based
approach, and a conditional random field is trained to segment the components into clauses [8], Andersson
used the techniques by addressing three different relation extraction applications: acronym extraction,
hyponymy extraction and factoid entity relation extraction [9], Park uses SAO (subject-action-object) for
identifying patent infringement [10], Yufeng combines the SAO structure with the VSM module to calculate
the patent similarity [11], Choi uses SAO-based text mining approach to building a technology tree [12],
authors this paper proposed the method of technical function discovery in patent databases [13,14].
In this paper we propose a novel approach, in which we tried to combine both statistical and semantic
features to increase the accuracy of prior-art search.
2. STATISTICAL ANALYSIS
The basic idea of LDA [15] - patents are represented as random mixtures of latent topics, where each
topic is characterized by a distribution according to the words from documents array. Based on the LDA
model can be used a patent database statistical analysis and distribution of patents by the unnamed topics.
The first stage of the statistical analysis is the tokenization of the patent text. In this case the tokens will
be individual words or N-grams from the patent text. After tokenization is necessary to make the
lemmatization, i.e. converting the extracted words to their base form for the most accurate building of term-
document matrix. The Big data processing framework Apache Spark and MLlib library for machine learning
allows to get a dictionary of all words from the documents set and build a term-document matrix.
The statistic analysis software [16] produces a patent search by different methods and evaluates the
efficiency of patent search. On base of statistical analysis testing results can be concluded: the most effective
method for vectors storage is storing in HDFS, comparing the distribution vectors of patents by topics on
base Cosine similarity, patent tokenization with replacing synonyms. Tokenization with replacing synonyms
includes the removal of stop-words, but beyond that, all words-synonyms are present to one word (base
alias). This approach allows us to build the most accurate term-document matrix and the LDA model but
slow down compared to the usual tokenization.
3. SEMANTIC ANALYSIS
On this step we perform a dependency trees construction for application and patent claims [18]. The text
of patent claims has one feature that makes the effective use of existing solutions for building dependency
trees is difficult. This feature is that the patent claims are written in one sentence, which sometimes includes
hundreds of words. To solve this problem has been developed algorithm of complex sentences segmentation
[19]. Sentences are segmented on base of transitional phrases of claims and special “marker” phrases such as
“, wherein”, “, said”; “, and”; “; and”; “, thereby”; “thereby”; “such that”; “so that”; “wherein”; “whereby”;
“where”; “when”; “wile”; etc. For remove numbering like “A.”, “a.”, “1.” or “1)” we used the regular
expression: «^(\d{1,4}|[a-zA-Z]{1,2})(\.|\))\s», for remove the references like «4. The device according to
claim 3...» we used the regular expression: «^.+(of|in|to) claim \d+(, )?». The sentences are separate by
punctuation, such as: «:», «;», «.», «!», «?», with used the regular expression «(\.|!|\?|:|;)\s?».
Then, we perform a morphological analysis of the patent text with TreeTagger [20] for Russian and
English language. After that we used MaltParser [21] to perform semantic parsing (building a dependency
tree).
Now let us consider the semantic analysis of the sentence: “The second end electrode and the center
electrode of the first tubular body are electrically connected to the center electrode and the first end electrode,
respectively, of the second tubular body” [22].
Table 1. Example of dependence tree in Conll’09 format
index word in
sentence word form itself word's lemma or
stem POS index of syntactic
parent Stanford typed dependencies
1 the the DT 4 det
2 second second JJ 4 amod
3 end end NN 4 nn
4 electrode electrode NN 16 nsubj
5 and and CC 4 cc
6 the the DT 8 det
7 center center NN 8 nn
8 electrode electrode NN 4 conj
9 of of IN 8 prep
10 the the DT 13 det
11 first first JJ 13 amod
12 tubular tubular JJ 13 amod
13 body body NN 9 pobj
14 are be VBP 16 cop
15 electrically electrically RB 16 advmod
16 connected connect VVN 0 null
Part-of-speech (POS) tags [23] are assigned to a single word according to its role in the sentence: the verb
(base form (VV), past participle (VVN), verb “be”, present, non-3rd person (VBP)), the noun (NN), adjective
(JJ), preposition (IN), determiner (DT), adverb (RB), coordinating conjunction (CC), etc.
The Stanford typed dependencies [24] representation was designed to provide a simple description of the
grammatical relationships in a sentence: “amod” - adjectival modifier, “det” – determiner, “pobj” - object of
a preposition, “nsubj” - nominal subject, “cop” - copula, “cc” - coordination, “conj” - conjunct, “nn” - noun
compound modifier, “advmod” - adverb modifier, “prep” - prepositional modifier, “punct” - punctuation).
In the collapsed representation, dependencies involving prepositions, conjuncts, as well as information
about the referent of relative clauses are collapsed to get direct dependencies between content words. We
removed from dependence trees the grammatical relations such as “punct”, “det”, “prep”, “cc”, etc. The
parent of removed node is transferred to the child node, фnd the indexes of words in the sentence are
renumbered:
nsubj(connected-16, electrode-4)
cc(electrode-4, and-5)
det(electrode-8, the-6)
nn(electrode-8, center-7)
conj(electrode-4, electrode-8)
become
nsubj(connected-10, electrode-3)
nsubj(connected-10, electrode-8)
Authors used an approach based on the MTT (Meaning-Text Theory) [25]. According to MMT collapsed
Stanford Dependencies (SD) merge into the set of Deep Syntactic relations (Table 2).
Table 2. Transformation from Stanford Dependencies to Deep Syntactic Relations
Collapsed Stanford Dependencies Deep Syntactic Structure
amod(electrode-3, second-1) ATTR(electrode-3, second-1)
nn(electrode-3, end-2) ATTR(electrode-3, end-2)
nsubj(connected-10, electrode-3) I(connected-10, electrode-3)
nn(electrode-5, center-4) ATTR(electrode-5, center-4)
nsubj(connected-10, electrode-5) I(connected-10, electrode-5)
amod(body-8, first-6) ATTR(body-8, first-6)
amod(body-8, tubular-7) ATTR(body-8, tubular-7)
pobj(electrode-5, body-8) II(electrode-5, body-8)
advmod(connected-10, electrically-9) ATTR(connected-10, electrically-9)
root (ROOT-0, connected-10) root (ROOT-0, connected-10)
nn(electrode-12, center-11) ATTR(electrode-12, center-11)
pobj(connected-10, electrode-12) II(connected-10, electrode-12)
amod(electrode-15, first-13) ATTR(electrode-15, first-13)
nn(electrode-15, end-14) ATTR(electrode-15, end-14)
pobj(connected-10, electrode-15) II(connected-10, electrode-15)
advmod(connected-10, respectively-16) ATTR(connected-10, respectively-16)
amod(body-19, second-17) ATTR(body-19, second-17)
amod(body-19, tubular-18) ATTR(body-19, tubular-18)
pobj(connected-10, body-19) II(connected-10, body-19)
According to MMT we transformed Stanford Dependencies (SD) into the Deep syntactic relations
(DSyntRel):
Actantial relations: the relation I – Subject and all its transform, nominal and agentive complement:
“nsubj”, “nsubjpass”, “csubj”, “csubjpass”, etc; : the relation II – Direct Object, Oblique (Prepositional)
Object, predicative complement, complement of an adjective, a preposition and a conjunction: “dobj”,
“iobj”, “pobj”, etc; the relation III uses with semantically trivalent transitive verb or it’s a SD such as
pobj_in, pobj_on, etc. (used together with prepositions).
The attributives relations (ATTR) cover all types of modifiers, determines, quantifiers, relative clauses,
circumstantials: “amod”, “advmod”, “nn”, etc.
4. SAO SIMILARITY CALCULATION BETWEEN PATENT
APPLICATION AND PATENTS
On the this step, we compare SAO (which are represented as semantic trees) for application claims with
trees from selected subset received on the step of statistical analysis. We re-rank relevant patents from
selected subset according to similarities between semantic trees.
In accordance with MTT trees at Deep syntactic representation level show dependency relations between
terms (words) and look as networks with arrows running from predicate nodes (“action” of SAO) to
argument nodes (“subject” and “object”). For example, SAO of patent application (from sentence “The
second end electrode and the center electrode of the first tubular body are electrically connected to the center
electrode and the first end electrode, respectively, of the second tubular body”) and SAO of patent (from
sentence “The center electrode is connected to the terminal through an internal wire”) have views presented
on Figure 1. At the null level of a SAO representation are the ROOTs (“actions”), at the first level are the
actantial relations I, II (“subject” and “object” respectively), at the second level are the attributive relations.
After the stage of SAO constructing the patent application is compared with each patent in the database.
A comparison of the application with the i-th patent occurs by comparing each of the j-th SAO of the
application with each k-th SAO of the i-th patent.
The first stage of SAO comparison
According to a SAO structure the tree’s root (ROOT) is the verb (“action”). If the ROOTs of the
application and the patent do not match, further comparison of the SAOs is not performed and comparison is
made for the next SAO of the patent application.
If the ROOTs of the application and the patent match, then the attributive (ATTR) structures associated
with the ROOTs (“action”) are compared.
If any terms (words) are not matched, then term from application is checked for significance. Testing the
significance is based on a predetermined table that contains IDF [17] - inverse document frequency of terms
in documents of patent databases. If the term's IDF is above a limit value then the term is not significant and
is not taken into account of the similarity coefficient calculation.
Fig. 1. First stage of SAO comparison for application (left view) and of patent (right view).
We introduce the similarity coefficient of the ATTR structures, associated with the ROOTs (“action”):
)(max
),(
)(
121
lkATTR
N
i
lk
A
ATTR
, TATA
ttS
, TATAK
i
=
=
, (1)
where TA
k
, TA
l
are the semantic sub-trees (attributive (ATTR) structures, associated with the ROOTs) of
k-sentence and l-sentence of an application claim and patent claim accordingly;
max
ATTR
(TA
k
, TA
l
) maximum number of ATTR structures for application claim and patent claim with
verification of term’s significance;
S(t
1
,t
2
) - MATCH function that determines a similarity of t1 and t2 terms of the compared semantic trees;
N
i
– number of terms for semantic tree TA
k
of application claim.
If the ROOTs of the application and the patent match, the similarity coefficient by “action” is:
)(1
lk
A
ATTR
A
, TATAKK +=
, (2)
In application there are 2 such ATTR structures, in patent - 0 ATTR, the maximum is 2. Similar ATTR
structures - 0. Without verification of term’s significance, the similarity coefficient by “action” is equal to 1
(ROOTs are match) + 0/2. With verification of term’s significance (IDF of term “respectively” more then
limit value, then we assume that in the application only 1 ATTR structure) the similarity coefficient by
“action” is equal to 1 (ROOTs are match) + 0/1.
The second stage of SAO comparison
At this stage, the I actantial relations and the attributive ATTR structures, associated with them, are
compared. An example of comparison of application and patent is shown on Figure 2 (sentence from patent:
“the center electrode and the terminal electrode are electrically connected to each other via the glass seal”).
If the I actantial relations (“subject”) of the application and the patent do not match, the coefficient of
similarity by “subject” is equal 0 and comparison is made for the next I “nodes” of the patent application.
If the I actantial relations (“subject”) of the application and the patent match, then the attributive (ATTR)
structures associated with the I “nodes” are compared.
Fig. 2. Second stage of SAO comparison for application (left view) and of patent (right view).
We introduce the similarity coefficient of the attributive (ATTR) structures, associated with the I actantial
relation (“subject”):
)(max
),(
)(
121
lkATTR
N
i
lk
I
ATTR
, TITI
ttS
, TITIK
i
=
=
, (3)
where TI
k
, TI
l
are the semantic sub-trees (attributive (ATTR) structures, associated with the I actant) of k-
sentence and l-sentence of an application claim and patent claim accordingly;
max
ATTR
(TI
k
, TI
l
) maximum number of ATTR structures for application claim and patent claim with
verification of term’s significance;
N
i
– number of terms for semantic tree TI
k
of application claim.
We introduce the coefficient of similarity by “subject”:
),max(
)(
t
1
lk
IN
i
I
ATTR
I
M
S
II
KK
K
k
i
i
=
=
+
=
, (4)
where
I
M
i
K
- similarity coefficient of the i-th I actantial relation,
1=
I
M
i
K
if match, not match -
0=
I
M
i
K
,
i
I
ATTR
K
- similarity coefficient of the ATTR structures, associated with the i-th I actantial relation,
max (I
k
, I
l
) – maximum number of the i-th I actantial relations for application claim and patent claim.
In application there are 2 such ATTR structures, associated with I actantial relations, in patent - 2 ATTR,
the maximum is 2. IDF for all term more then limit value. The maximum coefficient (1.25) is given by a pair
“electrode (ATTR) center, body, first, tubular” “electrode (ATTR) center”. Accordingly, the second pair is
“electrode (ATTR) second, end” “electrode (ATTR) terminal” with coefficient similarity is equal 1.
Similarity coefficient by “subject” is equal (1.25+1)/2=1.125.
The third stage of SAO comparison
At this stage, the II actantial relations (“object”) and the attributive ATTR structures, associated with
them, are compared. It occurs by an algorithm similar to the algorithm used in the second stage.
We introduce the similarity coefficient of the attributive (ATTR) structures, associated with the II
actantial relation (“object”):
)(max
),(
)(
121
lkATTR
N
i
lk
II
ATTR
, TIITII
ttS
, TIITIIK
i
=
=
, (5)
where TII
k
, TII
l
are the semantic sub-trees (attributive (ATTR) structures, associated with the II actant) of
k-sentence and l-sentence of an application claim and patent claim accordingly;
max
ATTR
(TII
k
, TII
l
) maximum number of ATTR structures for application claim and patent claim with
verification of term’s significance;
N
i
– number of terms for semantic tree TII
k
of application claim.
We introduce the coefficient of similarity by “object”:
),max(
)(
t
1
lk
IIN
i
II
ATTR
II
M
O
IIII
KK
K
k
i
i
=
=
+
=
, (6)
where
II
M
i
K
-similarity coefficient of the i-th II actantial relation,
1=
II
M
i
K
if match, not match -
0=
II
M
i
K
,
i
II
ATTR
K
- similarity coefficient of the ATTR structures, associated with the i-th II actantial relation,
max (II
k
, II
l
) – maximum number of the i-th II actantial relations for application claim and patent claim.
The similarity coefficient of SAO is summarized for each level and the total coefficient of the application
and patent similarity is the sum of trees similarity coefficients:
6
OSA
SAO
KKK
K ++
=, (7)
5. EXPERIMENTS AND RESULTS
The experiments are performed using a multiprocessor computer system with distributed memory
(cluster) of the Volgograd State Technical University. The cluster entered the 22nd edition of the Top-50
rating of the Russian supercomputers.
The software was installed on the nodes of the cluster:
- Apache Spark - open-source cluster-computing framework, engine for large-scale data processing;
- Library for software implementation of the LDA method – Apache Spark MLlib;
- PostgreSQL.
Statistical and semantic portraits were formed for 990,000 Russian- and English-language patents and
stored in the Document Storage on the basis of the HDFS file system.
The patent examination is implemented as MapReduce task (Figure 3). Initially, the input data stored in the
“/input” folder, which contains files with SAO which extracted from patents. The file name is the patent
number. All SAO from patent are stored in one file. The first stage – “only map” task divides each document
into pairs (patent number, SAO), the result is stored in a temporary folder “/output”. This folder serves as input
data for the second stage. Mapper (Compare) receives as input a cached file with the SAOs from patent
application and compares each of SAOs with the SAO from patent (folder “/output”. As a result we get pairs
(patent number, coefficient of similarity), a result is stored to “/output1” folder, from there it is read by the
standard hdfs-command and sorted by the avrg (k) value.
Fig. 3. MapReduce task of patent examination.
For the test set we specified the root folder “/sample”. It contains the folders: “/claims”, “/morph”,
“/semantic”, “/sao”, for a patent application, morphological portrait of patent application, semantic portrait
and extracted SAO, respectively. At the root of the folder “/sample” is stored a file with meta-data of all
patents.
We chose the coefficient “Recall” for sets of the top 5, 25, 100 most relevant patents retrieved as a
criterion of the semantic analysis effectiveness. The tables indicate the average Recall value for 10 tests
(Table 3). For software test we perform an imitation of the expert's work. From 10 random patents we extract
the citation lists, for the cited patents we also extract their citation lists and so on up to the fourth nesting
level. As a result, we obtain a test set of approximately 1,000 patents. For each of the 10 selected patents the
patents from the this citation list will be considered as relevant.
Table 3. Semantic analysis w/ & w/o verification of term’s significance
Feature Recall@5
Recall@25
Recall@100
with verification of term’s significance 82 84 92
without verification of term’s significance 78 80 86
Verification the term significance increases the Recall. This is due to a more accurate ranking of the
SAOs similarity, since insignificant, commonly used words do not affect the patents ranking.
6. CONCLUSION
We developed software for patent examination on base of extracted triples “subject-action-object” (SAO)
from patent texts. On the step of semantic analysis we applied a new method for building semantic network
on base of Stanford Dependencies and Meaning-Text Theory. On the step of semantic similarity calculation
we compare the SAOs for application and patent claims. Developed software prototype for the patent
examination task significantly reduced search time and increased such criteria of search effectiveness as
“Recall”.
REFERENCES
1. W. Magdy and G. J. F. Jones,2010. Applying the KISS Principle for the CLEF-IP 2010 Prior Art Candidate Patent
Search Task. Workshop of the Cross-Language Evaluation Forum, LABs and Workshops, Notebook Papers.
2. Manisha Verma, and Vasudeva Varma, 2011. Exploring Keyphrase Extraction and IPC Classification Vectors for Prior
Art Search. CLEF Notebook Papers/Labs/Workshop.
3. P. Mahdabi, F. Crestani, 2014. Query-Driven Mining of Citation Networks for Patent Citation Retrieval and
Recommendation, In ACM International Conference on Information and Knowledge Management (CIKM).
4. Xiaobing Xue and W. Bruce Croft, 2013. Modeling reformulation using query distributions. Journal ACM
Transactions on Information Systems. Volume 31, Issue 2.ACM New York, NY, USA
5. E D’hondt, S Verberne, N Oostdijk, L Boves, 2017. Patent Classification on Subgroup Level Using Balanced Winnow.
Current Challenges in Patent Information Retrieval, pp.299-324.
6. Bouadjenek, M, Sanner, S & Ferraro, G., 2015. A Study of Query Reformulation of Patent Prior Art Search with
Partial Patent Applications. 15th International Conference on Artificial Intelligence and Law (ICAIL 2015),
Association for Computing Machinery (ACM), USA, pp. 1-11.
7. Youngho Kim and W. Bruce Croft, 2014.. Diversifying Query Suggestions based on Query Documents. In Proc. of
SIGIR'14.
8. Ferraro, G, Suominen, H & Nualart, J., 2014. Segmentation of patent claims for improving their readability. 3rd
Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR).Stroudsburg, PA
18360, USA, pp. 66-73.
9. Linda Andersson, Allan Hanbury, Andreas Rauber, 2017. The Portability of three type of Text Mining Techniques into
the patent text genre. In M. Lupu, K. Mayer, J. Tait, and A. J. Trippe, Second edition, Current Challenges in Patent
Information Retrieval.
10. Hyunseok Park, Janghyeok Yoon, Kwangsoo Kim. Identifying patent infringement using SAO based semantic
technological similarities. Scientometrics (2012) 90: 515. doi:10.1007/s11192-011-0522-7.
11. DU Yufeng, JI Duo, JIANG Lixue, et al. Patent Similarity Measure Based on SAO Structure[J]. Chinese Sentence
and Clause for Text Information Processing, 2016, 30(1): 30-36.
12. Sungchul Choi, Hyunseok Park, Dongwoo Kang, Jae Yeol Lee, Kwangsoo Kim. An SAO-based text mining
approach to building a technology tree for technology planning. Expert Systems with Applications 39 (2012) 11443–
11455
13. D. M. Korobkin, S. A. Fomenkov, S. G. Kolesnikov. A function-based patent analysis for support of technical
solutions synthesis. ICIEAM. – [Publisher: IEEE], 2016. – 4 p. – DOI: 10.1109/ICIEAM.2016.7911581.
14. Korobkin, Dmitriy M.; Fomenkov, Sergey A.; Kolesnikov, Sergey G.; Golovanchikov, lexander B. Technical
Function Discovery in Patent Databases for Generating Innovative Solutions. IADIS International Journal on
Computer Science & Information Systems . 2016, Vol. 11 Issue 2, p241-245. 5p.
15. Blei, David M., 2003. Latent Dirichlet allocation, A Journal of Machine Learning Research 3 (4–5), pp. 993-1022.
16. Dmitriy M. Korobkin, Sergey А. Fomenkov, Alla G. Kravets, Alexander B. Golovanchikov, 2016. Patent data
analysis system for information extraction tasks. 13th International Conference on Applied Computing (AC) 2016. pp.
215-219.
17. Salton, G.; Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Information Processing &
Management, 24 (5), pp. 513–523.
18. Alla G. Kravets, Dmitriy M. Korobkin, Mikhail A. Dykov. E-patent examiner: Two-steps approach for patents prior-
art retrieval. IISA 2015 Conference Proceeding / Ionian University, Institute of Electrical and Electronics Engineers
(IEEE) [Piscataway, USA]. – 2015. – DOI: 10.1109/IISA.2015.7388074.
19. Korobkin, D., Fomenkov, S., Kravets, A., Kolesnikov, S., Dykov, M., 2015. Three-Steps Methodology for Patents
Prior-Art Retrieval and Structured Physical Knowledge Extracting. Proceeding of CIT&DS 2015, pp. 124-136.
20. Toutanova, K., Manning, C.D., 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech
Tagger, In Proceeding EMNLP '00. Volume 13, Hong Kong, pp. 63-70.
21. Hall, Johan, 2006. MaltParser – An Architecture for Inductive Labeled Dependency Parsing, University of Colorado,
p.92
22. Patent US 6912111 B2 “Impulse lightning arresters and pulse arrester columns for power lines”.
23. Haverinen, Katri, Viljanen, Timo, Laippala, Veronika, Kohonen, Samuel, Ginter, Filip, Salakoski, Tapio, 2010.
Treebanking Finnish. In: Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories
(TLT).
24. Marie-Catherine de Marneffe and Christopher D. Manning, 2016: Stanford typed dependencies manual.
25. Mel’čuk, Igor A., 1988. Dependency Syntax: Theory and Practice. SUNY Publ., NY.
... For the task of extraction of the physical effects descriptions from patent texts are used the previously developed procedures [5] for segmentation of complex sentences of patent texts, morphological and semantic analysis with the construction of dependency trees, and building the deep-syntactic structures based on the Meaning-Text Theory [6] for reduced Stanford dependencies. Example of deep-syntactic structures from the sentence of patent US20130307109A1 "(1)When (2)light (3)enters (4)the (5)semiconductor (6)junction (7)of (8)such (9)an (10)element (11), (12) According to the PhE model [7] developed at the CAD Department of Volgograd State Technical University, input cause-action produces an output effect-action on the environment or object. For extraction of physical effect from text it is necessary to find the predicates (verbs) such as "change", "increase", "decrease", "dependence", "change", "generate", "act", "cause", etc. that show some "action" with arguments. ...
... The obtained components of the technical function representation according to the SAO model recognized in the patent sentences are listed in Table 3. The algorithm of grouping (comparison) SAOs [12] is used to combine several "Subject-Action-Object" structures into one common structure. ...
Conference Paper
Full-text available
Authors use the physical effects (PhE) for the synthesis of the physical operation principle of a technical system. PhEs realize the technical functions that compose the selected functional structure of the designed technical system. Authors developed the method for extraction the physical effects descriptions from the patents of USPTO and RosPatent databases, and the method for extracting of technical functions from Natural Language documents including patent texts. The method of automated construction of a matrix of physical functions performed by physical effects is based on the detection of latent dependencies in the consolidated matrix "Physical Effects-Technical Functions". Developed software has been tested for tasks of extraction of the physical effects and technical functions from patent documents.
... The paper developed and described a method for analyzing patent arrays [19] to obtain criteria-based assessments [20,21] of the innovative potential and prospects of the developed high-tech technical systems and technologies. There are four main criteria reflecting various aspects of the patented developmenteconomic, informational and criteria of mass character for the current and next years. ...
Chapter
Full-text available
In today’s rapidly developing technological world, new ideas, inventions and developments appear daily. At the same time, individual technologies may have common features, through the use of similar methods, modification and expansion of existing technologies, or solving a common problem for which developments are being created. Patenting is often used to preserve these ideas and protect intellectual property. During the development of the program for the analysis of the patent array to obtain criteria assessments of innovation potential and prospects, enshrined in patent high–tech technical systems and technologies, the subject area—patent array was investigated, methods for analyzing texts in natural language and various options for determining the criteria of innovation potential were considered. As a result, algorithms were developed to obtain the following criteria for assessing the innovative potential of a patent: the mass character of the subject of this technology for the current year and the estimated frequency of occurrence for the next, the economic characteristics of the patent holder’s company and the potential citation of the patent. These criteria are determined based on the analysis of texts and data of patents using clustering, classification, regression analysis and normalization of the name of the patent holder. The developed algorithms were tested on patents issued by the US Patent and Trademark Office, as well as on Google Patents.KeywordsCyber-physical systemsPatentsFact extraction
... The paper developed and described a method for analyzing patent arrays [19] to obtain criteria-based assessments [20,21] of the innovative potential and prospects of the developed high-tech technical systems and technologies. There are four main criteria reflecting various aspects of the patented developmenteconomic, informational and criteria of mass character for the current and next years. ...
Article
n today's rapidly developing technological world, new ideas, inventions and developments appear daily. At the same time, individual technologies may have common features, through the use of similar methods, modification and expansion of existing technologies, or solving a common problem for which developments are being created. Patenting is often used to preserve these ideas and protect intellectual property. During the development of the program for the analysis of the patent array to obtain criteria assessments of innovation potential and prospects, enshrined in patent high—tech technical systems and technologies, the subject area — patent array was investigated, methods for analyzing texts in natural language and various options for determining the criteria of innovation potential were considered. As a result, algorithms were developed to obtain the following criteria for assessing the innovative potential of a patent: the mass nature of the subject of this technology for the current year and the estimated frequency of occurrence for the next, the economic characteristics of the patent holder's company, the potential citation of the patent. These criteria are determined based on the analysis of texts and patent data by means of clustering, classification, regression analysis, normalization of the name of the patent holder. The developed algorithms have been tested on patents issued by the US Patent and Trademark Office, as well as on Google Patents.
... The paper developed and described a method for analyzing patent arrays [19] to obtain criteria-based assessments [20,21] of the innovative potential and prospects of the developed high-tech technical systems and technologies. There are four main criteria reflecting various aspects of the patented developmenteconomic, informational and criteria of mass character for the current and next years. ...
... Authors of work suggest [3,4] to carry out automation of the major, initial stages of design of new technical systems and technologies based on the updated knowledge bases received from the world patent array [5,6], including the patent base of RosPatent [7]. ...
Article
Full-text available
The task of automation of the synthesis of innovative solutions in the field of technical systems and technologies is one of the most priority problems of science. The authors propose to automate the most important, initial stages of the design of new technical systems and technologies based on updated knowledge bases obtained from the world patent database, including the RosPatent patent database. According to the method of morphological analysis and synthesis, it is assumed that the main structural features (functions of technical objects) are extracted from some technical solution (patent). All these features are reduced to a morphological matrix, combined, which gives a lot of new solutions. The paper describes the developing a software for extracting the descriptions of the technical functions from Russian patents. The grammar of the presentation of technical functions descriptions according to the model “Action-Object-Condition” in the Russian-language patents was formed; algorithms for the initial processing of the patent database, the extraction of technical functions through the analysis of dependency trees, the formation of the morphological matrix was developed. The software consisting of a module of patent database processing; a module of text segmentation of the patent formula; a module of semantic text analysis; a module of extracting descriptions of technical functions; a module of presenting the results of patent database processing, was tested to solve practical problems.
... The segmentation algorithm [6] consists of the following transformations using regular expressions. To remove the numbering like «A. », «a. ...
Chapter
Full-text available
Authors use physical effects (PE) to synthesize the physical operation principle of a technical system. PEs implements the technical functions (TF) that describe the functional structure of the declared technical system. The method finds out relationships between physical effects and technical functions performed by them based on the construction of term-document matrices and the search for hidden dependencies in them. To this end, the authors developed a method for extracting descriptions of physical effects from patents in USPTO and RosPatent databases, as well as a method for extracting technical functions from the natural language texts of the same documents. The developed software has been tested for the tasks of extracting physical effects and technical functions from patent documents.
... The obtained components of the technical function representation according to the SAO model recognized in the patent sentences are listed in Table 1 The algorithm of grouping (comparison) SAOs [16] is used to combine several "Subject-Action-Object" structures into one common structure. ...
Article
Full-text available
For the synthesis of the physical operation principle of a technical system, the authors used the physical effects realizing the technical functions that compose the selected functional structure of the designed technical system. Authors developed the method for extraction of the descriptions of physical effects from the patents of USPTO and RosPatent databases, and the method for extracting of technical functions from Natural Language documents including patent texts. The method of automated construction of a matrix of physical functions, performed by physical effects, is based on the detection of latent dependencies in the consolidated matrix “Physical Effects – Technical Functions”. Developed software has been tested for tasks of extraction of the physical effects and technical functions from patent documents.
Chapter
The paper presents a text mining approach to identifying technological trajectories. The main problem addressed is the selection of documents related to a particular technology. These documents are needed to identify a trajectory of the technology. Two different methods were compared (based on word2vec and lexical-morphological and syntactic search). The aim of developed approach is to retrieve more information about a given technology and about technologies that could affect its development. We present the results of experiments on a dataset containing over 4.4 million of documents as a part of USPTO patent database. Self-driving car technology was chosen as an example. The result of the research shows that the developed methods are useful for automated information retrieval as the first stage of the analysis and identification of technological trajectories.
Chapter
Analysis and modeling of the cross-thematic states of the world prior-art is a voluminous task that includes many subtasks. In order to assess the prior art, build forecasts and carry out analysis, it is necessary to develop and construct cross-thematic relationships between patents within an array in many ways. The scientific result of the work was the first developed formal metric “belonging to the technological epoch” for assessing the cross-thematic states of the world prior art, as well as the technique and method of applying formal metrics. This paper presents the development of a software module based on the developed metric.
Article
Full-text available
The use of the global patent space to determine the scientific and technological priorities for the technical systems development (identifying patent trends) allows one to forecast the direction of the technical systems development and, accordingly, select patents of priority technical subjects as a source for updating the technical functions database and physical effects database. The authors propose an original method that uses as trend terms not individual unigrams or n-gram (usually for existing methods and systems), but structured descriptions of technical functions in the form "Subject-Action-Object" (SAO), which in the authors' opinion are the basis of the invention.
Conference Paper
Full-text available
The paper describes a method designed to extracting structured information from patent data for support of solving a problem of new technical solution synthesis. Ways of technical function implementations extracted from patents are represented by “object-condition-action” model. An extraneous syntactic parser is used for assigning syntactic category to words from patent claims section. A context-sensitive grammar was developed for achieving compliance of parsed text words to “object-condition-action” model components.
Chapter
Full-text available
In this paper we proposed a three-steps methodology to solve problem of processing natural language text information from global patent space such as patents prior-art retrieval and fact extracting (descriptions of physical effects), consists of preprocessing step, statistical analysis and semantic analysis. On the step of statistical analysis we developed a method of patent documents topic distributions, obtained using multiple pre-trained LDA models. On the step of semantic analysis we applied the developed method of the complex sentence decomposition into several simple dependency trees, the method of the semantic tree simplification on base of Deep syntactic relations, the method of semantic similarity calculation. Applying of these methods allowed us to increase the accuracy of application novelty calculation, and also to solve the problem of extraction a structured physical knowledge in the form of physical effects from preprocessing patent texts. Developed methods were adopted to deal with large amount of text data.
Chapter
In the past decade research into automated patent classification has mainly focused on the higher levels of International Patent Classification (IPC) hierarchy. The patent community has expressed a need for more precise classification to better aid current pre-classification and retrieval efforts (Benzineb and Guyot, Current challenges in patent information retrieval. Springer, New York, pp 239–261, 2011). In this chapter we investigate the three main difficulties associated with automated classification on the lowest level in the IPC, i.e. subgroup level. In an effort to improve classification accuracy on this level, we (1) compare flat classification with a two-step hierarchical system which models the IPC hierarchy and (2) examine the impact of combining unigrams with PoS-filtered skipgrams on both the subclass and subgroup levels. We present experiments on English patent abstracts from the well-known WIPO-alpha benchmark data set, as well as from the more realistic CLEF-IP 2010 data set. We find that the flat and hierarchical classification approaches achieve similar performance on a small data set but that the latter is much more feasible under real-life conditions. Additionally, we find that combining unigram and skipgram features leads to similar and highly significant improvements in classification performance (over unigram-only features) on both the subclass and subgroup levels, but only if sufficient training data is available.
Chapter
In this book chapter, we examined the portability of several different well-known text mining techniques on patent text. We test the techniques by addressing three different relation extraction applications: acronym extraction, hyponymy extraction and factoid entity relation extraction. These applications require different types of natural language processing tools, from simple regular expression matching (acronym extraction), to part of speech and phrase chunking (hyponymy extraction), to a full-blown dependency parser (factoid extraction). With the relation extraction applications presented in this chapter, we want to elucidate the requirements needed of general natural language processing tools when deployed on patent text for a specific extraction task. On the other hand, we also present language technology methods which are already portable to the patent genre with no or only moderate adaptations to the text genre.
Article
Many domain-specific search tasks are initiated by document-length queries, e.g., patent invalidity search aims to find prior art related to a new (query) patent. We call this type of search Query Document Search. In this type of search, the initial query document is typically long and contains diverse aspects (or sub-topics). Users tend to issue many queries based on the initial document to retrieve relevant documents. To help users in this situation, we propose a method to suggest diverse queries that can cover multiple aspects of the query document. We first identify multiple query aspects and then provide diverse query suggestions that are effective for retrieving relevant documents as well being related to more query aspects. In the experiments, we demonstrate that our approach is effective in comparison to previous query suggestion methods.
Article
Query reformulation modifies the original query with the aim of better matching the vocabulary of the relevant documents, and consequently improving ranking effectiveness. Previous models typically generate words and phrases related to the original query, but do not consider how these words and phrases would fit together in actual queries. In this article, a novel framework is proposed that models reformulation as a distribution of actual queries, where each query is a variation of the original query. This approach considers an actual query as the basic unit and thus captures important query-level dependencies between words and phrases. An implementation of this framework that only uses publicly available resources is proposed, which makes fair comparisons with other methods using TREC collections possible. Specifically, this implementation consists of a query generation step that analyzes the passages containing query words to generate reformulated queries and a probability estimation step that learns a distribution for reformulated queries by optimizing the retrieval performance. Experiments on TREC collections show that the proposed model can significantly outperform previous reformulation models.
Article
A technology tree (TechTree) is a branching diagram that expresses relationships among product components, technologies, or functions of a technology in a specific technology area. A TechTree identifies strategic core technologies and is a useful tool to support decision making in a given market environment for organizations with specified capabilities. However, existing TechTrees generally overemphasize qualitative and expert-dependent knowledge rather than incorporating quantitative and objective information. In addition, the traditional process of developing a TechTree requires vast amounts of information, which costs considerably in terms of time, and cannot provide integrated information from a variety of technological perspectives simultaneously. To remedy these problems, this research presents a text mining approach based on Subject–Action–Object (SAO) structures; this approach develops a TechTree by extracting and analyzing SAO structures from patent documents. The extracted SAO structures are categorized by similarities, and are identified by the type of technological implications. To demonstrate the feasibility of the proposed approach, we developed a TechTree regarding Proton Exchange Fuel Cell technology.