Content uploaded by Astrid Krenz
Author content
All content in this area was uploaded by Astrid Krenz on Sep 25, 2020
Content may be subject to copyright.
One-Class Support Vector Machine and LDA Topic Model Integration -
Evidence for AI Patents
Anton Thielmann ∗
Christoph Weisser†
Astrid Krenz∗∗
August 2020
Abstract. The present contribution suggests a two-step classification rule for unsuper-
vised document classification, using one-class Support Vector Machines and Latent Dirich-
let Allocation Topic Modeling. The integration of both algorithms allows the usage of
labelled, but independent training data, not stemming from the data set to be classified.
The manual labelling when trying to classify a specific class from an unlabelled data set
can thus be circumvented. By choosing appropriate document representations and param-
eters in the one-class Support Vector Machine, the differences between the independent
training class and the data set to be classified become negligible. The method is applied
to a large data set on patents for the European Union.
Keywords: Artificial Intelligence, Support Vector Machines, Topic Modeling.
JEL: C50.
∗University of G¨ottingen, Centre for Statistics, Humboldtallee 3, 37073 G¨ottingen, Germany; email:
anton.thielmann@stud.uni-goettingen.de.
†University of G¨ottingen, Centre for Statistics, Humboldtallee 3, 37073 G¨ottingen, Germany; email:
c.weisser@stud.uni-goettingen.de
∗∗ University of Sussex, Digital Futures at Work (Digit) Research Centre, Jubilee Building, Brighton, UK; email:
a.m.krenz@sussex.ac.uk
1. Introduction
A common problem in document classification is the inscrutable amount of often unlabelled
data. Unsupervised document classification algorithms (see for example [2], [25], [14]) are often
only used in order to get a broad overview of the topics in these large data sets. A more thor-
ough classification of documents often involves manually creating a labelled training data set.
The present paper circumvents the manual labelling by making use of one-class Support Vector
Machines (SVM) [22] and Latent Dirichlet Allocation (LDA) topic modeling [2] using a large
data set on patent applications from the European Patent Office (EPO).
Patent classification is an important task in assessing technological progress as well as defining
innovative capability [6]. The international patent corpus is an enormous source which already
contains more than 100 million patents. The patent data comes with a set of industrial clas-
sification codes, the so-called Cooperative Patent Classification (CPC). For the official patent
data, the classification is mostly done manually, and there exists by now a set of approximately
250,000 predefined classification entries [1]. A problem arises for new patent applications from
entirely new technology areas: there are simply no CPC codes available, and as such these
patents cannot easily be assigned to a classification code. This applies to patents in the area
of artificial intelligence (AI), for example. The classification problem gets more severe since the
number of patents in the field of AI is increasing [29].
The primary goal of this paper is to circumvent some of the problems that come along with
unsupervised document classification and to classify patents that cover the topic of AI. This
is mainly achieved by taking advantage of the similarity between scientific papers and patent
documents. The applied methods are arranged in an innovative manner to yield an order that
allows topic models to contribute to classification problems beyond a mere descriptive func-
tion, resulting in a two-step classification rule. The approach can easily be transferred to other
unsupervised document classification problems.
2. Related Literature
This paper especially draws from the literature on natural language processing (NLP) and
patent classification. The approaches to patent classification are manifold and range from simple
keyword search [35] and subsequent classification to the application of neural networks [15, 9].
However, most of these classification approaches either make use of an already labelled data
1
set [28, 15, 9] or have experts scan the data set and label it manually [32]. When it comes to
unsupervised patent classification, the literature is more focused on either keyword approaches
or topic modeling [26].
Other methods used in unsupervised document classification as k-means or sequential cluster-
ing [4, 25] are mostly used in hierarchical classification algorithms [7, 12] where unsupervised
methods are combined with SVM or k-Nearest Neighbour (kNN) algorithms to obtain accurate
classification results.
The present classification thus applies two methods, the use of one-class SVM on a web-scraped
data set and LDA topic modeling.
To the best of our knowledge, one-class document classification has so far not been used in
patent classification. This could be due to the complexity of the data, the large number of
subclasses (250,000 CPC codes, and 70,000 International Patent Classification (IPC) codes) and
the possibility of multiple labelling. However, similar to the present approach, Fujino and Isozaki
[10] made use of the similarity between scientific papers and patent abstracts, labelling scientific
papers with IPC codes.
The successful application of one-class SVM in document classification has already been shown
by Manevitz and Yousef [17] and is used in a very similar way in the present approach.
The present approach makes use of these two successful methods and combines them with
LDA topic modeling [2].
3. Data
The data set at hand is the so-called global PATSTAT data set from the EPO, which contains
bibliographical and legal event patent data from leading industrialised and developing countries.
It is comprised of more than 100 million patents which are extracted from the EPO’s various
databases. In order to get a more suitably sized data set, the complete data set was filtered in
the following way: Only patents of type ’A’ (the pure patents) from 1980 - today are included.
All patents for which no address of any patent holder is known, which are not originating from
an EU28 country, and that are not written in English are excluded. This guarantees that we
are working with data for the European Union.
From the roughly remaining 540,000 patents all abstracts are taken into account, preprocessed
and analyzed. The preprocessing of the text data follows the common text preprocessing in NLP
2
[30, 33]. All words are put in lowercase letters and tokenized. All numbers and symbols are
removed and stopword removal [24] is applied using spacys built in dictionary [11] extended by
patent specific words as ”method” or ”patent”. The remaining words are lemmatized, resulting
in patent abstracts of 54 words on average.
4. Method
In this section, the one-class SVM for document classification using term frequency inverse
document frequency (tf-idf) vectors as input and the LDA topic model are described. We
describe how the training data by using abstracts of scientific papers is generated. The general
idea is to create a classification rule that classifies a majority of all positive documents as
positive, while false positive classifications are negligible. Subsequently, the falsely positive
classified documents are identified using LDA topic models. Before going into detail for each
step, the algorithm as seen in Figure ?? is summarized: First the training data is generated by
simple web scraping. The similarity between scientific paper abstracts and patent abstracts is
used, and scientific papers covering the relevant topic are scraped. Second, a one-class SVM is
trained on the single class training data, which consists of the scraped scientific papers. Thirdly,
the model predictions on the original data set are obtained. Subsequently, the so-classified
patents are analyzed with the help of LDA topic models. The relevant topics are identified with
the help of visual representations [23, 5]. Depending on the perceived quality of the one-class
SVM classification and the LDA topic models, the patents related to the relevant topic will be
identified.
4.1. One-Class Support Vector Machines. The obvious advantage of one-class classification
algorithms is the fact that only one class needs to be synthesized. Sch¨olkopf et al. [22] introduced
such an algorithm which extends the classical SVM [31, 3] algorithm and only incorporates a
single class in training. Considering x1, ...x`∈Xas training data belonging to a single class and
`∈Nbeing the number of observations, the classical SVM optimization problem is extended
only slightly, to obtain a decision function f(x) which is positive on S (with S as a subset of the
feature space H) and negative on the complement ¯
S, so that
f(x) =
+1 if x ∈S
−1 if x ∈¯
S
(1)
3
Figure 1. Prediction Procedure
Non-formally, the probability that a test point from the data sets probability distribution lies
outside of S is bounded by some a priori specified value ν∈(0,1] [17, 22]. More formally, the
optimization problem becomes:
min
w∈H,ξ∈R`
+,ρ∈R
1
2||w||2+1
`ν
`
X
i=1
ξi−ρ
subject to: (w·Φ(xi)) ≥ρ−ξii= 1,2, ..., ` ξi≥0
(2)
where Φ is a kernel map X→H, such that Φ can be computed by evaluating a simple kernel
function, w is the normal vector to the hyperplane H, ρis an offset parameterizing the hyperplane
in the feature space and ξiare nonzero slack variables.
4
The tradeoff between the decision function
f(x) = sgn((w·Φ(x)) −ρ) (3)
being positive for most of the training data x1, ...x`∈Xand ||w|| being small, is controlled by
ν, which fulfills multiple boundaries. It is firstly an upper bound on the fraction of outliers [22]
and secondly a lower bound of the fraction of support vectors in relation to the total number
of training data. In the present case, as we are using training data not originating from the
original data set, finding the optimal νis of crucial importance. The idea is to set νvery low
and thus ”create” a larger subset S of the feature space H in order to avoid overfitting problems
and to obtain a classification rule that is applicable to a more diverse data set, thus integrating
a training class that does not stem from the data set to be classified.
Using the one-class SVM for document classification, the documents need to be represented
in a more suitable way. Similar to Manevitz and Yousef [17] we used more than one document
representation, namely a binary representation and a tf-idf representation [21], with
tf −idf(word) = f requency(word)·[log k
K(word)+ 1] (4)
where k is the total number of words in the dictionary and K(word) is giving the total number
of documents the word appears in.
4.2. Topic Modeling. To ensure the integration of the unrelated training data and thus justify
the first step of the classification, we make use of LDA topic models [2]. LDA topic modeling is
an unsupervised machine learning technique that detects word and phrase patterns, defined as
topics, in a set of documents. The general idea is that topics are characterized by a distribution
over words, no matter the positional occurrence of words. Thus, as defined by Blei et al. [2],
documents are represented by a random mixture over these latent topics. Basically, we look
for the joint posterior probability of a distribution of topics for each document, N topics for
each document and a distribution of words for each topic, given the corpus of all documents.
Formally it can be stated as:
p(θ,z,w|α, β) (5)
5
with wbeing M documents denoted as a sequence of words w= (w1, ..., wN) from the corpus
D={w1, ..., wM},θrepresenting the probability of the i-th document to contain the j-th topic.
Thus, the joint probability is equal to:
p(θ,z,w|α, β) = p(θ|α)
N
Y
n=1
p(zn|θ)p(wn|zn, β) (6)
with zn,n= 1, ..., N being Ntopics for each document and αand βrepresenting the distribution-
related parameters of a Dirichlet distribution.
In order to get a general understanding of the data and the topics that the data set covers,
topic models could have already been applied to the original data set right from the beginning.
However, by applying the topic models only to those documents classified as positive by the
classifier, which is what our approach in this paper does, the resulting topics are much easier to
interpret.
4.3. Training class. Since the patent data is not labelled in such a way that it is possible
to create a training class that adequately represents AI, a different, more complex approach is
required. To generate a valid data set as training data without requiring experts to scan and
read a large part of the 540,000 patent abstracts, the similarity between scientific papers and
patents is used, in a similar manner as done by Fujino and Isozaki [10]. Both document types
have in common that a short summary of the full text, the abstract, is given at the beginning
of the document. Scientific papers, however, often come with so called ”keywords”, which
thus implicitly assign a topic to the paper. As such, labelled training data was scraped from
IEEEXPLORE [13]. 1008 papers which had the keyword ”artificial intelligence” were scraped
as the training class. One should keep in mind that other relevant keywords such as ”machine
learning”, e.g., could also be web-scraped, and one could try out different websites to scrape the
training data from, which is something that could be done in future analyses. For our paper,
however, we focused on the scraping of the main AI keyword.
In order to confirm our expectation that the web-scraping of scientific AI papers was successful,
LDA topic models were performed on the corpus of the scientific AI papers. We kept the models
simple and fixed the number of topics to 10, not using bigram or trigram models [34], but the
simpler unigram LDA model [2]. To evaluate and interpret the model, we looked at LDAvis
plots [23]. The results are very satisfying, with some of the most salient terms being ”artificial”,
6
”network”, ”intelligence” or ”algorithm”. This can be further confirmed when looking at a
wordcloud corresponding to one randomly selected topic generated by the LDA topic model (see
Figure 2).
Figure 2. Wordcloud LDA topic model - web-scraping for scientific AI papers
5. Results
Similar to Manevitz and Yousef [17] the one-class SVM yielded the best results when using
the radial basis function (rbf) as kernel function. However, unlike Manevitz and Yousef [17]
the best results in terms of document representation are achieved with the tf-idf representation,
using Scikit-learns [20] built-in Vectorizer applying the L2 normalization, accounting for different
document lengths.
The one-class SVM, using the rbf as kernel function and tf-idf document representations,
classified 1272 patents as ”AI-patents”. Regarding the number of about 540,000 original patents,
this amounts to about 0.24 percent of all patents that are classified as AI patents. We set νvery
low at 0.005 as we found that the web-scraping resulted in a very accurate training class which
does not include a lot of outliers. Setting νat 0.5, for example, yielded as expected significantly
worse classification results. Furthermore, the main objective in the first classification level is to
avoid relevant patents being classified as non-AI. The results of the topic modeling, again using
simple unigram models with a fixed number of 10 topics, reveal that the patents share a large
resemblance with the scraped scientific AI papers, but do not include words such as ”artificial”
or ”intelligence” (see Figure 3).
7
Figure 3. Wordclouds for topics for the classified AI patents
Using the LDAvis representation [23] to conduct further analyses on AI sub-categories and
heterogeneity, the topic models revealed two topics that clearly describe sub-fields of AI, namely
NLP and neural networks. They include the prominent terms ”neural” and ”network” as well
as ”language” and ”process” (see figure 4).
Figure 4. Wordclouds - LDAvis representation
Depending on the identified topics, as well as the underlying topics’ distribution, the final clas-
sification rule regarding the sub-set and heterogeneity analyses is dependent on the researcher/
the research question. Choosing a very conservative approach and only selecting the patents,
were the most dominant topic is one of the two defined sub fields of AI, we were able to identify
126 patents that make use of neural networks and 80 patents that make use of NLP. If, for ex-
ample, we select those patents, that have a prevalence of at least 15% for the ”neural network”
topic, we find 207 patents covering the topic of neural networks.
6. Conclusion
The implementation of a one-class SVM and LDA topic modeling on an independent data set
on European patents allows very good classification results. By this methodological approach the
problems of unsupervised document classification, the identification and classification of sparsely
represented classes, can be circumvented without the need for manual labelling. The presented
8
method is especially useful when a reliable extraction of a negative training class is not possible.
Further implementations could be to use the classification results from the LDA topic model to
train a different, more powerful classifier with the original data, as Naive Bayes [27, 18], Random
Forest [36], Artificial Neural Networks [19] or classical SVMs [8, 16] including a negative class
during training. The idea of the current approach would shift towards generating a suitable
training data set and subsequently using classification algorithms to classify the documents.
9
References
[1] EPO (2020), https://www.epo.org/index.html.
[2] Blei, David M and Ng, Andrew Y and Jordan, Michael I (2003). Latent dirichlet allocation.
Journal of machine Learning research, 3, Jan, 993-1022.
[3] Boser, Bernhard E and Guyon, Isabelle M and Vapnik, Vladimir N (1992). A training
algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Com-
putational learning theory, 144-152.
[4] Buchta, Christian and Kober, Martin and Feinerer, Ingo and Hornik, Kurt (2012). Spherical
k-means clustering. Journal of Statistical Software, 50, 10, 1-22.
[5] Chaney, Allison June-Barlow and Blei, David M (2012). Visualizing topic models. Sixth
international AAAI conference on weblogs and social media.
[6] Chen, Yunwei and Yang, Zhiping and Shu, Fang and Hu, Zhengyin and Meyer, Martin and
Bhattacharya, Sujit (2009). A patent based evaluation of technological innovation capability
in eight economic regions in PR China, World Patent Information, 31, 2, 104-110.
[7] Chen, Yen-Liang and Chang, Yuan-Che (2012). A three-phase method for patent classifica-
tion. Information Processing & Management, 48, 6, 1017-1030.
[8] Colas, Fabrice and Brazdil, Pavel (2006). Comparison of SVM and some older classification
algorithms in text classification tasks, IFIP International Conference on Artificial Intelligence
in Theory and Practice, Springer Verlag, 169-178.
[9] Fall, Caspar J and T¨orcsv´ari, Atilla and Benzineb, Karim and Karetka, Gabor (2003). Au-
tomated categorization in the international patent classification. ACM New York, NY, USA,
Acm Sigir Forum, 37, 1, 10-25.
[10] Fujino, Akinori and Isozaki, Hideki (2008). Multi-label Classification using Logistic Regres-
sion Models for NTCIR-7 Patent Mining Task. NTCIR.
[11] Honnibal, Matthew and Montani, Ines (2017). spacy 2: Natural language understanding
with bloom embeddings, convolutional neural networks and incremental parsing, To appear,
7, 1.
[12] Hu, Jie and Li, Shaobo and Yao, Yong and Yu, Liya and Yang, Guanci and Hu, Jianjun
(2018). Patent keyword extraction algorithm based on distributed representation for patent
classification. Entropy, 20, 2, 104.
[13] IEEE Xplore (2020). IEEE Xplore digital library, url = https://ieeexplore.ieee.org/Xplore/home.jsp.
[14] Ko, Youngjoong and Seo, Jungyun (2000). Automatic text categorization by unsupervised
learning. COLING 2000 Volume 1: The 18th International Conference on Computational
Linguistics.
10
[15] Li, Shaobo and Hu, Jie and Cui, Yuxin and Hu, Jianjun (2018). DeepPatent: patent
classification with convolutional neural networks and word embedding. Scientometrics, 117,
2, 721-744.
[16] Liang, Jiu-Zhen (2004). SVM multi-classifier and web document classification. Proceedings
of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.
04EX826), 3, 1347-1351.
[17] Manevitz, Larry M and Yousef, Malik (2001). One-class SVMs for document classification.
Journal of machine Learning research, 2, Dec. 139-154.
[18] McCallum, Andrew and Nigam, Kamal and others (1998). A comparison of event models
for naive bayes text classification. AAAI-98 workshop on learning for text categorization, 752,
1, 41-48.
[19] Moraes, Rodrigo and Valiati, Jo˜aO Francisco and Neto, Wilson P Gavi˜aO (2013). Document-
level sentiment classification: An empirical comparison between SVM and ANN. Expert Sys-
tems with Applications, 40, 2, 621-633.
[20] Pedregosa, Fabian and Varoquaux, Ga¨el and Gramfort, Alexandre and Michel, Vincent
and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter
and Weiss, Ron and Dubourg, Vincent and others (2011). Scikit-learn: Machine learning in
Python. The Journal of machine Learning research, 12, 2825-2830.
[21] Salton, Gerard and Buckley, Christopher (1988). Term-weighting approaches in automatic
text retrieval. Information processing & management, 24, 5, 513-523.
[22] Sch¨olkopf, Bernhard and Platt, John C and Shawe-Taylor, John and Smola, Alex J and
Williamson, Robert C (2001). Estimating the support of a high-dimensional distribution.
Neural computation, 13, 7, 1443-1471.
[23] Sievert, Carson and Shirley, Kenneth (2014). LDAvis: A method for visualizing and inter-
preting topics, Proceedings of the workshop on interactive language learning, visualization,
and interfaces, 63-70.
[24] Silva, Catarina and Ribeiro, Bernardete (2003). The importance of stop word removal on
recall values in text categorization. Proceedings of the International Joint Conference on
Neural Networks, 3, 1661-1666.
[25] Slonim, Noam and Friedman, Nir and Tishby, Naftali (2002). Unsupervised document
classification using sequential information maximization. Proceedings of the 25th annual
international ACM SIGIR conference on Research and development in information retrieval,
129-136.
[26] Suominen, Arho and Toivanen, Hannes and Sepp¨anen, Marko (2017). Firms’ knowledge
profiles: Mapping patent data with unsupervised learning. Technological Forecasting and
Social Change, 115, 131-142.
11
[27] Ting, SL and Ip, WH and Tsang, Albert HC (2011). Is Naive Bayes a good classifier for
document classification. International Journal of Software Engineering and Its Applications,
5, 3, 37-46.
[28] Tran, Tung and Kavuluru, Ramakanth (2017). ”Supervised Approaches to Assign Cooper-
ative Patent Classification (CPC) Codes to Patents” in ”Mining Intelligence and Knowledge
Exploration” edited by Ghosh, Ashish and Pal, Rajarshi and Prasath, Rajendra. Springer
International Publishing, 22-34, isbn=978-3-319-71928-3.
[29] Tseng, Chun-Yao and Ting, Ping-Ho (2013). Patent analysis for technology development
of artificial intelligence: A country-level comparative study. Innovation, 15, 4, 463-475.
[30] Uysal, Alper Kursat and Gunal, Serkan (2014). The impact of preprocessing on text clas-
sification. Information Processing & Management, 50, 1, 104-112.
[31] Vapnik, Vladimir N (1995). The nature of statistical learning theory, Springer-Verlag.
[32] Venugopalan, Subhashini and Rai, Varun (2015). Topic based classification and pattern
identification in patents. Technological Forecasting and Social Change, 94, 236-250.
[33] Vijayarani, S and Ilamathi, Ms J and Nithya, Ms (2015). Preprocessing techniques for text
mining-an overview. International Journal of Computer Science & Communication Networks
5, 1, 7-16.
[34] Wang, Sida I and Manning, Christopher D (2012). Baselines and bigrams: Simple, good
sentiment and topic classification. Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers), 90-94.
[35] WIPO, WIPO (2019). Technology Trends 2019: Artificial Intelligence. Geneva: World
Intellectual Property Organization.
[36] Xu, Baoxun and Guo, Xiufeng and Ye, Yunming and Cheng, Jiefeng (2012). An Improved
Random Forest Classifier for Text Categorization. JCP, 7, 12, 2913-2920.
12