ArticlePDF Available

One-Class Support Vector Machine and LDA Topic Model Integration - Evidence for AI Patents


Abstract and Figures

The present contribution suggests a two-step classification rule for unsupervised document classification, using one-class Support Vector Machines and Latent Dirichlet Allocation Topic Modeling. The integration of both algorithms allows the usage of labelled, but independent training data, not stemming from the data set to be classified. The manual labelling when trying to classify a specific class from an unlabelled data set can thus be circumvented. By choosing appropriate document representations and parameters in the one-class Support Vector Machine, the differences between the independent training class and the data set to be classified become negligible. The method is applied to a large data set on patents for the European Union.
Content may be subject to copyright.
One-Class Support Vector Machine and LDA Topic Model Integration -
Evidence for AI Patents
Anton Thielmann
Christoph Weisser
Astrid Krenz∗∗
August 2020
Abstract. The present contribution suggests a two-step classification rule for unsuper-
vised document classification, using one-class Support Vector Machines and Latent Dirich-
let Allocation Topic Modeling. The integration of both algorithms allows the usage of
labelled, but independent training data, not stemming from the data set to be classified.
The manual labelling when trying to classify a specific class from an unlabelled data set
can thus be circumvented. By choosing appropriate document representations and param-
eters in the one-class Support Vector Machine, the differences between the independent
training class and the data set to be classified become negligible. The method is applied
to a large data set on patents for the European Union.
Keywords: Artificial Intelligence, Support Vector Machines, Topic Modeling.
JEL: C50.
University of G¨ottingen, Centre for Statistics, Humboldtallee 3, 37073 G¨ottingen, Germany; email:
University of G¨ottingen, Centre for Statistics, Humboldtallee 3, 37073 G¨ottingen, Germany; email:
∗∗ University of Sussex, Digital Futures at Work (Digit) Research Centre, Jubilee Building, Brighton, UK; email:
1. Introduction
A common problem in document classification is the inscrutable amount of often unlabelled
data. Unsupervised document classification algorithms (see for example [2], [25], [14]) are often
only used in order to get a broad overview of the topics in these large data sets. A more thor-
ough classification of documents often involves manually creating a labelled training data set.
The present paper circumvents the manual labelling by making use of one-class Support Vector
Machines (SVM) [22] and Latent Dirichlet Allocation (LDA) topic modeling [2] using a large
data set on patent applications from the European Patent Office (EPO).
Patent classification is an important task in assessing technological progress as well as defining
innovative capability [6]. The international patent corpus is an enormous source which already
contains more than 100 million patents. The patent data comes with a set of industrial clas-
sification codes, the so-called Cooperative Patent Classification (CPC). For the official patent
data, the classification is mostly done manually, and there exists by now a set of approximately
250,000 predefined classification entries [1]. A problem arises for new patent applications from
entirely new technology areas: there are simply no CPC codes available, and as such these
patents cannot easily be assigned to a classification code. This applies to patents in the area
of artificial intelligence (AI), for example. The classification problem gets more severe since the
number of patents in the field of AI is increasing [29].
The primary goal of this paper is to circumvent some of the problems that come along with
unsupervised document classification and to classify patents that cover the topic of AI. This
is mainly achieved by taking advantage of the similarity between scientific papers and patent
documents. The applied methods are arranged in an innovative manner to yield an order that
allows topic models to contribute to classification problems beyond a mere descriptive func-
tion, resulting in a two-step classification rule. The approach can easily be transferred to other
unsupervised document classification problems.
2. Related Literature
This paper especially draws from the literature on natural language processing (NLP) and
patent classification. The approaches to patent classification are manifold and range from simple
keyword search [35] and subsequent classification to the application of neural networks [15, 9].
However, most of these classification approaches either make use of an already labelled data
set [28, 15, 9] or have experts scan the data set and label it manually [32]. When it comes to
unsupervised patent classification, the literature is more focused on either keyword approaches
or topic modeling [26].
Other methods used in unsupervised document classification as k-means or sequential cluster-
ing [4, 25] are mostly used in hierarchical classification algorithms [7, 12] where unsupervised
methods are combined with SVM or k-Nearest Neighbour (kNN) algorithms to obtain accurate
classification results.
The present classification thus applies two methods, the use of one-class SVM on a web-scraped
data set and LDA topic modeling.
To the best of our knowledge, one-class document classification has so far not been used in
patent classification. This could be due to the complexity of the data, the large number of
subclasses (250,000 CPC codes, and 70,000 International Patent Classification (IPC) codes) and
the possibility of multiple labelling. However, similar to the present approach, Fujino and Isozaki
[10] made use of the similarity between scientific papers and patent abstracts, labelling scientific
papers with IPC codes.
The successful application of one-class SVM in document classification has already been shown
by Manevitz and Yousef [17] and is used in a very similar way in the present approach.
The present approach makes use of these two successful methods and combines them with
LDA topic modeling [2].
3. Data
The data set at hand is the so-called global PATSTAT data set from the EPO, which contains
bibliographical and legal event patent data from leading industrialised and developing countries.
It is comprised of more than 100 million patents which are extracted from the EPO’s various
databases. In order to get a more suitably sized data set, the complete data set was filtered in
the following way: Only patents of type ’A’ (the pure patents) from 1980 - today are included.
All patents for which no address of any patent holder is known, which are not originating from
an EU28 country, and that are not written in English are excluded. This guarantees that we
are working with data for the European Union.
From the roughly remaining 540,000 patents all abstracts are taken into account, preprocessed
and analyzed. The preprocessing of the text data follows the common text preprocessing in NLP
[30, 33]. All words are put in lowercase letters and tokenized. All numbers and symbols are
removed and stopword removal [24] is applied using spacys built in dictionary [11] extended by
patent specific words as ”method” or ”patent”. The remaining words are lemmatized, resulting
in patent abstracts of 54 words on average.
4. Method
In this section, the one-class SVM for document classification using term frequency inverse
document frequency (tf-idf) vectors as input and the LDA topic model are described. We
describe how the training data by using abstracts of scientific papers is generated. The general
idea is to create a classification rule that classifies a majority of all positive documents as
positive, while false positive classifications are negligible. Subsequently, the falsely positive
classified documents are identified using LDA topic models. Before going into detail for each
step, the algorithm as seen in Figure ?? is summarized: First the training data is generated by
simple web scraping. The similarity between scientific paper abstracts and patent abstracts is
used, and scientific papers covering the relevant topic are scraped. Second, a one-class SVM is
trained on the single class training data, which consists of the scraped scientific papers. Thirdly,
the model predictions on the original data set are obtained. Subsequently, the so-classified
patents are analyzed with the help of LDA topic models. The relevant topics are identified with
the help of visual representations [23, 5]. Depending on the perceived quality of the one-class
SVM classification and the LDA topic models, the patents related to the relevant topic will be
4.1. One-Class Support Vector Machines. The obvious advantage of one-class classification
algorithms is the fact that only one class needs to be synthesized. Sch¨olkopf et al. [22] introduced
such an algorithm which extends the classical SVM [31, 3] algorithm and only incorporates a
single class in training. Considering x1, ...x`Xas training data belonging to a single class and
`Nbeing the number of observations, the classical SVM optimization problem is extended
only slightly, to obtain a decision function f(x) which is positive on S (with S as a subset of the
feature space H) and negative on the complement ¯
S, so that
f(x) =
+1 if x S
1 if x ¯
Figure 1. Prediction Procedure
Non-formally, the probability that a test point from the data sets probability distribution lies
outside of S is bounded by some a priori specified value ν(0,1] [17, 22]. More formally, the
optimization problem becomes:
subject to: (w·Φ(xi)) ρξii= 1,2, ..., ` ξi0
where Φ is a kernel map XH, such that Φ can be computed by evaluating a simple kernel
function, w is the normal vector to the hyperplane H, ρis an offset parameterizing the hyperplane
in the feature space and ξiare nonzero slack variables.
The tradeoff between the decision function
f(x) = sgn((w·Φ(x)) ρ) (3)
being positive for most of the training data x1, ...x`Xand ||w|| being small, is controlled by
ν, which fulfills multiple boundaries. It is firstly an upper bound on the fraction of outliers [22]
and secondly a lower bound of the fraction of support vectors in relation to the total number
of training data. In the present case, as we are using training data not originating from the
original data set, finding the optimal νis of crucial importance. The idea is to set νvery low
and thus ”create” a larger subset S of the feature space H in order to avoid overfitting problems
and to obtain a classification rule that is applicable to a more diverse data set, thus integrating
a training class that does not stem from the data set to be classified.
Using the one-class SVM for document classification, the documents need to be represented
in a more suitable way. Similar to Manevitz and Yousef [17] we used more than one document
representation, namely a binary representation and a tf-idf representation [21], with
tf idf(word) = f requency(word)·[log k
K(word)+ 1] (4)
where k is the total number of words in the dictionary and K(word) is giving the total number
of documents the word appears in.
4.2. Topic Modeling. To ensure the integration of the unrelated training data and thus justify
the first step of the classification, we make use of LDA topic models [2]. LDA topic modeling is
an unsupervised machine learning technique that detects word and phrase patterns, defined as
topics, in a set of documents. The general idea is that topics are characterized by a distribution
over words, no matter the positional occurrence of words. Thus, as defined by Blei et al. [2],
documents are represented by a random mixture over these latent topics. Basically, we look
for the joint posterior probability of a distribution of topics for each document, N topics for
each document and a distribution of words for each topic, given the corpus of all documents.
Formally it can be stated as:
p(θ,z,w|α, β) (5)
with wbeing M documents denoted as a sequence of words w= (w1, ..., wN) from the corpus
D={w1, ..., wM},θrepresenting the probability of the i-th document to contain the j-th topic.
Thus, the joint probability is equal to:
p(θ,z,w|α, β) = p(θ|α)
p(zn|θ)p(wn|zn, β) (6)
with zn,n= 1, ..., N being Ntopics for each document and αand βrepresenting the distribution-
related parameters of a Dirichlet distribution.
In order to get a general understanding of the data and the topics that the data set covers,
topic models could have already been applied to the original data set right from the beginning.
However, by applying the topic models only to those documents classified as positive by the
classifier, which is what our approach in this paper does, the resulting topics are much easier to
4.3. Training class. Since the patent data is not labelled in such a way that it is possible
to create a training class that adequately represents AI, a different, more complex approach is
required. To generate a valid data set as training data without requiring experts to scan and
read a large part of the 540,000 patent abstracts, the similarity between scientific papers and
patents is used, in a similar manner as done by Fujino and Isozaki [10]. Both document types
have in common that a short summary of the full text, the abstract, is given at the beginning
of the document. Scientific papers, however, often come with so called ”keywords”, which
thus implicitly assign a topic to the paper. As such, labelled training data was scraped from
IEEEXPLORE [13]. 1008 papers which had the keyword ”artificial intelligence” were scraped
as the training class. One should keep in mind that other relevant keywords such as ”machine
learning”, e.g., could also be web-scraped, and one could try out different websites to scrape the
training data from, which is something that could be done in future analyses. For our paper,
however, we focused on the scraping of the main AI keyword.
In order to confirm our expectation that the web-scraping of scientific AI papers was successful,
LDA topic models were performed on the corpus of the scientific AI papers. We kept the models
simple and fixed the number of topics to 10, not using bigram or trigram models [34], but the
simpler unigram LDA model [2]. To evaluate and interpret the model, we looked at LDAvis
plots [23]. The results are very satisfying, with some of the most salient terms being ”artificial”,
”network”, ”intelligence” or ”algorithm”. This can be further confirmed when looking at a
wordcloud corresponding to one randomly selected topic generated by the LDA topic model (see
Figure 2).
Figure 2. Wordcloud LDA topic model - web-scraping for scientific AI papers
5. Results
Similar to Manevitz and Yousef [17] the one-class SVM yielded the best results when using
the radial basis function (rbf) as kernel function. However, unlike Manevitz and Yousef [17]
the best results in terms of document representation are achieved with the tf-idf representation,
using Scikit-learns [20] built-in Vectorizer applying the L2 normalization, accounting for different
document lengths.
The one-class SVM, using the rbf as kernel function and tf-idf document representations,
classified 1272 patents as ”AI-patents”. Regarding the number of about 540,000 original patents,
this amounts to about 0.24 percent of all patents that are classified as AI patents. We set νvery
low at 0.005 as we found that the web-scraping resulted in a very accurate training class which
does not include a lot of outliers. Setting νat 0.5, for example, yielded as expected significantly
worse classification results. Furthermore, the main objective in the first classification level is to
avoid relevant patents being classified as non-AI. The results of the topic modeling, again using
simple unigram models with a fixed number of 10 topics, reveal that the patents share a large
resemblance with the scraped scientific AI papers, but do not include words such as ”artificial”
or ”intelligence” (see Figure 3).
Figure 3. Wordclouds for topics for the classified AI patents
Using the LDAvis representation [23] to conduct further analyses on AI sub-categories and
heterogeneity, the topic models revealed two topics that clearly describe sub-fields of AI, namely
NLP and neural networks. They include the prominent terms ”neural” and ”network” as well
as ”language” and ”process” (see figure 4).
Figure 4. Wordclouds - LDAvis representation
Depending on the identified topics, as well as the underlying topics’ distribution, the final clas-
sification rule regarding the sub-set and heterogeneity analyses is dependent on the researcher/
the research question. Choosing a very conservative approach and only selecting the patents,
were the most dominant topic is one of the two defined sub fields of AI, we were able to identify
126 patents that make use of neural networks and 80 patents that make use of NLP. If, for ex-
ample, we select those patents, that have a prevalence of at least 15% for the ”neural network”
topic, we find 207 patents covering the topic of neural networks.
6. Conclusion
The implementation of a one-class SVM and LDA topic modeling on an independent data set
on European patents allows very good classification results. By this methodological approach the
problems of unsupervised document classification, the identification and classification of sparsely
represented classes, can be circumvented without the need for manual labelling. The presented
method is especially useful when a reliable extraction of a negative training class is not possible.
Further implementations could be to use the classification results from the LDA topic model to
train a different, more powerful classifier with the original data, as Naive Bayes [27, 18], Random
Forest [36], Artificial Neural Networks [19] or classical SVMs [8, 16] including a negative class
during training. The idea of the current approach would shift towards generating a suitable
training data set and subsequently using classification algorithms to classify the documents.
[1] EPO (2020),
[2] Blei, David M and Ng, Andrew Y and Jordan, Michael I (2003). Latent dirichlet allocation.
Journal of machine Learning research, 3, Jan, 993-1022.
[3] Boser, Bernhard E and Guyon, Isabelle M and Vapnik, Vladimir N (1992). A training
algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Com-
putational learning theory, 144-152.
[4] Buchta, Christian and Kober, Martin and Feinerer, Ingo and Hornik, Kurt (2012). Spherical
k-means clustering. Journal of Statistical Software, 50, 10, 1-22.
[5] Chaney, Allison June-Barlow and Blei, David M (2012). Visualizing topic models. Sixth
international AAAI conference on weblogs and social media.
[6] Chen, Yunwei and Yang, Zhiping and Shu, Fang and Hu, Zhengyin and Meyer, Martin and
Bhattacharya, Sujit (2009). A patent based evaluation of technological innovation capability
in eight economic regions in PR China, World Patent Information, 31, 2, 104-110.
[7] Chen, Yen-Liang and Chang, Yuan-Che (2012). A three-phase method for patent classifica-
tion. Information Processing & Management, 48, 6, 1017-1030.
[8] Colas, Fabrice and Brazdil, Pavel (2006). Comparison of SVM and some older classification
algorithms in text classification tasks, IFIP International Conference on Artificial Intelligence
in Theory and Practice, Springer Verlag, 169-178.
[9] Fall, Caspar J and T¨orcsv´ari, Atilla and Benzineb, Karim and Karetka, Gabor (2003). Au-
tomated categorization in the international patent classification. ACM New York, NY, USA,
Acm Sigir Forum, 37, 1, 10-25.
[10] Fujino, Akinori and Isozaki, Hideki (2008). Multi-label Classification using Logistic Regres-
sion Models for NTCIR-7 Patent Mining Task. NTCIR.
[11] Honnibal, Matthew and Montani, Ines (2017). spacy 2: Natural language understanding
with bloom embeddings, convolutional neural networks and incremental parsing, To appear,
7, 1.
[12] Hu, Jie and Li, Shaobo and Yao, Yong and Yu, Liya and Yang, Guanci and Hu, Jianjun
(2018). Patent keyword extraction algorithm based on distributed representation for patent
classification. Entropy, 20, 2, 104.
[13] IEEE Xplore (2020). IEEE Xplore digital library, url =
[14] Ko, Youngjoong and Seo, Jungyun (2000). Automatic text categorization by unsupervised
learning. COLING 2000 Volume 1: The 18th International Conference on Computational
[15] Li, Shaobo and Hu, Jie and Cui, Yuxin and Hu, Jianjun (2018). DeepPatent: patent
classification with convolutional neural networks and word embedding. Scientometrics, 117,
2, 721-744.
[16] Liang, Jiu-Zhen (2004). SVM multi-classifier and web document classification. Proceedings
of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.
04EX826), 3, 1347-1351.
[17] Manevitz, Larry M and Yousef, Malik (2001). One-class SVMs for document classification.
Journal of machine Learning research, 2, Dec. 139-154.
[18] McCallum, Andrew and Nigam, Kamal and others (1998). A comparison of event models
for naive bayes text classification. AAAI-98 workshop on learning for text categorization, 752,
1, 41-48.
[19] Moraes, Rodrigo and Valiati, Jo˜aO Francisco and Neto, Wilson P Gavi˜aO (2013). Document-
level sentiment classification: An empirical comparison between SVM and ANN. Expert Sys-
tems with Applications, 40, 2, 621-633.
[20] Pedregosa, Fabian and Varoquaux, Ga¨el and Gramfort, Alexandre and Michel, Vincent
and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter
and Weiss, Ron and Dubourg, Vincent and others (2011). Scikit-learn: Machine learning in
Python. The Journal of machine Learning research, 12, 2825-2830.
[21] Salton, Gerard and Buckley, Christopher (1988). Term-weighting approaches in automatic
text retrieval. Information processing & management, 24, 5, 513-523.
[22] Sch¨olkopf, Bernhard and Platt, John C and Shawe-Taylor, John and Smola, Alex J and
Williamson, Robert C (2001). Estimating the support of a high-dimensional distribution.
Neural computation, 13, 7, 1443-1471.
[23] Sievert, Carson and Shirley, Kenneth (2014). LDAvis: A method for visualizing and inter-
preting topics, Proceedings of the workshop on interactive language learning, visualization,
and interfaces, 63-70.
[24] Silva, Catarina and Ribeiro, Bernardete (2003). The importance of stop word removal on
recall values in text categorization. Proceedings of the International Joint Conference on
Neural Networks, 3, 1661-1666.
[25] Slonim, Noam and Friedman, Nir and Tishby, Naftali (2002). Unsupervised document
classification using sequential information maximization. Proceedings of the 25th annual
international ACM SIGIR conference on Research and development in information retrieval,
[26] Suominen, Arho and Toivanen, Hannes and Sepp¨anen, Marko (2017). Firms’ knowledge
profiles: Mapping patent data with unsupervised learning. Technological Forecasting and
Social Change, 115, 131-142.
[27] Ting, SL and Ip, WH and Tsang, Albert HC (2011). Is Naive Bayes a good classifier for
document classification. International Journal of Software Engineering and Its Applications,
5, 3, 37-46.
[28] Tran, Tung and Kavuluru, Ramakanth (2017). ”Supervised Approaches to Assign Cooper-
ative Patent Classification (CPC) Codes to Patents” in ”Mining Intelligence and Knowledge
Exploration” edited by Ghosh, Ashish and Pal, Rajarshi and Prasath, Rajendra. Springer
International Publishing, 22-34, isbn=978-3-319-71928-3.
[29] Tseng, Chun-Yao and Ting, Ping-Ho (2013). Patent analysis for technology development
of artificial intelligence: A country-level comparative study. Innovation, 15, 4, 463-475.
[30] Uysal, Alper Kursat and Gunal, Serkan (2014). The impact of preprocessing on text clas-
sification. Information Processing & Management, 50, 1, 104-112.
[31] Vapnik, Vladimir N (1995). The nature of statistical learning theory, Springer-Verlag.
[32] Venugopalan, Subhashini and Rai, Varun (2015). Topic based classification and pattern
identification in patents. Technological Forecasting and Social Change, 94, 236-250.
[33] Vijayarani, S and Ilamathi, Ms J and Nithya, Ms (2015). Preprocessing techniques for text
mining-an overview. International Journal of Computer Science & Communication Networks
5, 1, 7-16.
[34] Wang, Sida I and Manning, Christopher D (2012). Baselines and bigrams: Simple, good
sentiment and topic classification. Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers), 90-94.
[35] WIPO, WIPO (2019). Technology Trends 2019: Artificial Intelligence. Geneva: World
Intellectual Property Organization.
[36] Xu, Baoxun and Guo, Xiufeng and Ye, Yunming and Cheng, Jiefeng (2012). An Improved
Random Forest Classifier for Text Categorization. JCP, 7, 12, 2913-2920.
... Measurement of Artificial Intelligence at the micro-level is done in the past research literature either at the level of the firm (Eurostat, 2022, Hunt et al., 2022, using job vacancy or occupational data or by generating an occupation-based AI index (Acemoglu and Restrepo 2020;Felten 7 et al., 2018Felten 7 et al., , 2019, or based on patents, considering the inventions/ innovations that relate to the elements of AI (Thielmann et al., 2021, Martinelli et al., 2021. In terms of firm-level evidence, Hunt et al. (2022) conducted a survey on more than 750 UK business leaders. ...
... The OECD (2022b) has recently set up a framework for the classification of AI systems, acknowledging that AI is diffusing through all sectors. Thielmann et al. (2021) classify AI patents based on a patent data set from the European Patent Office and by developing a new two-step method that uses web-scraping for AI-related keywords as well as a support vector machine and LDA topic modelling. Martinelli et al. (2021) also classified AI patents, they use classification rules based on keywords and industry (CPC) codes. ...
Full-text available
International comparisons of the development of AI have increasingly been made with the use of composite indexes. These aim to identify countries that are at the forefront of AI and those lagging behind. In this paper, we focus in particular on the talent component related to AI. We analyse a new dataset based on the Global Artificial Intelligence Index and compare new measures of tech talent workers with traditional education-based measures. We provide an overview of different measures of AI, explain how tech talent working in AI, Machine Learning, and Data Science is measured, and which data are utilised. We show that several traditional measures are well approximated by the new measures. Countries positioning as forerunners or laggards in terms of new tech talent workers are detected. Our analysis establishes a link between governments' R&D and AI-related expenditures or targets and the number of researchers and new tech talent workers in a country, as well as between talent, the digital infrastructure, and innovative activities.
... Additionally, we introduce a novel evaluation metric, based upon the centroid cluster of stopwords in the embedding space. Given the approach of enhancing the reference corpus, the described model might be especially useful when evaluating short texts or identifying sparsely represented topics in a corpus [48,49]. Through the inherent sparsity of the data, the words best describing a topic might not be included in the reference corpus and an enhancement could thus greatly improve the creation of topics. ...
Extracting and identifying latent topics in large text corpora has gained increasing importance in Natural Language Processing (NLP). Most models, whether probabilistic models similar to Latent Dirichlet Allocation (LDA) or neural topic models, follow the same underlying approach of topic interpretability and topic extraction. We propose a method that incorporates a deeper understanding of both sentence and document themes, and goes beyond simply analyzing word frequencies in the data. This allows our model to detect latent topics that may include uncommon words or neologisms, as well as words not present in the documents themselves. Additionally, we propose several new evaluation metrics based on intruder words and similarity measures in the semantic space. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task. We demonstrate the competitive performance of our method with a large benchmark study, and achieve superior results compared to state-of-the-art topic modeling and document clustering models.
... Supervised one-class document classification, however, is remarkably accurate [14,15]. The adaptation of supervised one-class document classification approaches to the unsupervised case, with the integration of out-of-domain training data has shown promise regarding the classification of patents in the realm of artificial intelligence [26]. The integration of out-of-domain training data into document classification offers a great opportunity for document classification due to the large availability of text data on the internet. ...
Full-text available
Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
Full-text available
Data mining is used for finding the useful information from the large amount of data. Data mining techniques are used to implement and solve different types of research problems. The research related. It is also called knowledge discovery in text (KDT) or knowledge of intelligent text analysis. Text mining is a technique which extracts information from both structured and unstructured data and also finding patterns. Text mining techniques are used in various types of research domains like natural language processing, information retrieval, text classification and text clustering.
Full-text available
Many text mining tasks such as text retrieval, text summarization, and text comparisons depend on the extraction of representative keywords from the main text. Most existing keyword extraction algorithms are based on discrete bag-of-words type of word representation of the text. In this paper, we propose a patent keyword extraction algorithm (PKEA) based on the distributed Skip-gram model for patent classification. We also develop a set of quantitative performance measures for keyword extraction evaluation based on information gain and cross-validation, based on Support Vector Machine (SVM) classification, which are valuable when human-annotated keywords are not available. We used a standard benchmark dataset and a homemade patent dataset to evaluate the performance of PKEA. Our patent dataset includes 2500 patents from five distinct technological fields related to autonomous cars (GPS systems, lidar systems, object recognition systems, radar systems, and vehicle control systems). We compared our method with Frequency, Term Frequency-Inverse Document Frequency (TF-IDF), TextRank and Rapid Automatic Keyword Extraction (RAKE). The experimental results show that our proposed algorithm provides a promising way to extract keywords from patent texts for patent classification.
Full-text available
Patent data has been an obvious choice for analysis leading to strategic technology intelligence, yet, the recent proliferation of machine learning text analysis methods is changing the status of traditional patent data analysis methods and approaches. This article discusses the benefits and constraints of machine learning approaches in industry level patent analysis, and to this end offers a demonstration of unsupervised learning based analysis of the leading telecommunication firms between 2001 and 2014 based on about 160,000 USPTO full-text patents. Data were classified using full-text descriptions with Latent Dirichlet Allocation, and latent patterns emerging through the unsupervised learning process were modelled by company and year to create an overall view of patenting within the industry, and to forecast future trends. Our results demonstrate company-specific differences in their knowledge profiles, as well as show the evolution of the knowledge profiles of industry leaders from hardware to software focussed technology strategies. The results cast also light on the dynamics of emerging and declining knowledge areas in the telecommunication industry. Our results prompt a consideration of the current status of established approaches to patent landscaping, such as key-word or technology classifications and other approaches relying on semantic labelling, in the context of novel machine learning approaches. Finally, we discuss implications for policy makers, and, in particular, for strategic management in firms.
Full-text available
Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment.
Patent classification is an essential task in patent information management and patent knowledge mining. However, this task is still largely done manually due to the unsatisfactory performance of current algorithms. Recently, deep learning methods such as convolutional neural networks (CNN) have led to great progress in image processing, voice recognition, and speech recognition, which has yet to be applied to patent classification. We proposed DeepPatent, a deep learning algorithm for patent classification based on CNN and word vector embedding. We evaluated the algorithm on the standard patent classification benchmark dataset CLEF-IP and compared it with other algorithms in the CLEF-IP competition. Experiments showed that DeepPatent with automatic feature extraction achieved a classification precision of 83.98%, which outperformed all the existing algorithms that used the same information for training. Its performance is better than the state-of-art patent classifier with a precision of 83.50%, whose performance is, however, based on 4000 characters from the description section and a lot of feature engineering while DeepPatent only used the title and abstract information. DeepPatent is further tested on USPTO-2M, a patent classification benchmark data set that we contributed with 2,000,147 records after data cleaning of 2,679,443 USA raw utility patent documents in 637 categories at the subclass level. Our algorithms achieved a precision of 73.88%.
In the history of research of the learning problem one can extract four periods that can be characterized by four bright events: (i) Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks.
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance.
Artificial intelligence (AI) plays a key role in knowledge economies, because it can be used to develop systems that think like humans, act like humans, think rationally, and act rationally (Russell & Norvig, 2010). In this study, we divide AI into four sub-technological fields: Problem reasoning and solving, machine learning, network structures, and knowledge processing systems. This study investigates three main issues related to the technology development of AI. First, the aggregate technology development of AI is examined, and the four sub-technological fields of AI are compared. Second, we employ measures of patent quantity and patent quality to demonstrate the technology development of AI in different countries. Finally, we investigate the technology positions of different countries in the four sub-technological fields of AI. By analyzing a patent and citation dataset comprised of all patents granted by the United States patent and trademark office from 1976 to 2010, we obtain empirical findings that help us understand the technology development of AI in different countries. The major contributions of this study are four measures of patent quantity (PCA, PCI, SHAI, and SHIA) and three measures of patent quality (citation ratios, CII, and TCT). These measures are helpful in understanding technological development of AI in different counties. Moreover, we use patent citation data and investigate the technology flow in AI, in order to determine the technology position of different countries in the four sub-technological fields of AI.