Chapter

Extrinsic Evaluation of Cross-Lingual Embeddings on the Patent Classification Task

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this article we compare the quality of various cross-lingual embeddings on the cross-lingual text classification problem and explore the possibility of transferring knowledge between languages. We consider Multilingual Unsupervised and Supervised Embeddings (MUSE), multilingual BERT embeddings, XLM-RoBERTa (XLM-R) model embeddings, and Language-Agnostic Sentence Representations (LASER). Various classification algorithms use them as inputs for solving the task of the patent categorization. It is a zero-shot cross-lingual classification task since the training and the validation sets include the English texts, and the test set consists of documents in Russian.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This integration made it possible to fully utilize the invariance of the HSN of semantically close fragments of text in some languages when solving applied problems of information retrieval and analysis of large arrays of documents and to form a foundation for the creation of systems for processing multimodal information using relational-situational analysis [36]. As a result, HSN are actively used for cross-language searches for text borrowings [37], analysis of arrays of regulatory documents [38], content filtering [39], and patent classification [40]. ...
Article
Patent classification is an important part of the patent examination and management process. Using efficient and accurate automatic patent classification can significantly improve patent retrieval performance. Current monolingual patent classification models, on the other hand, are insufficient for cross-lingual patent tasks. Therefore, research into cross-lingual patent categorization is crucial. In this paper, we proposed a cross-lingual patent classification model based on the pre-trained model named XLM-R–CNN. Besides, we constructed a large patent dataset called XLPatent including Chinese, English, and German. We conducted experiments to evaluate model performance with several metrics. The experimental results showed that XLM-R–CNN achieved a classification accuracy of 73% and average precision of 94%.
Article
Full-text available
We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI data set), cross-lingual document classification (MLDoc data set), and parallel corpus mining (BUCC data set) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low- resource languages. Our implementation, the pre-trained encoder, and the multilingual test set are available at https://github.com/facebookresearch/LASER .
Article
Full-text available
State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent works showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally show that our method is a first step towards fully unsupervised machine translation and describe experiments on the English-Esperanto language pair, on which there only exists a limited amount of parallel data.
Article
Full-text available
The World Intellectual Property Organization is currently developing a system for assisting users in categorizing patent documents in the International Patent Classification (IPC). The system should support the classification of documents in several languages and aims to assist users in locating relevant IPC symbols by providing them with a convenient web-based service. The approach taken for developing such a system relies on powerful machine learning algorithms that are trained on manually classified documents to recognize IPC topics. We detail in-house results of applying a custom-built state-of-the-art computer-assisted categorizer to English, French, Russian, and German-language patent documents. We find that reliable computer-assisted categorization at IPC subclass level is an achievable goal for the statistical methods employed here. A categorization system suggesting three IPC symbols for each document can predict the main IPC class correctly for around 90% of documents, and the main IPC subclass for about 85% of documents. The accuracy of the system at main group level is enhanced if the user first validates the correct IPC class.
Article
Full-text available
A new reference collection of patent documents for training and testing automated categorization systems is established and described in detail. This collection is tailored for automating the attribution of international patent classification codes to patent applications and is made publicly available for future research work. We report the results of applying a variety of machine learning algorithms to the automated categorization of English-language patent documents. This procedure involves a complex hierarchical taxonomy, within which we classify documents into 114 classes and 451 subclasses. Several measures of categorization success are described and evaluated. We investigate how best to resolve the training problems related to the attribution of multiple classification codes to each patent document.
Article
We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via multilingual adversarial training. Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data.
Article
Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores up to 32.8, without using even a single parallel sentence at training time.
Article
Cross-lingual embedding models allow us to project words from different languages into a shared embedding space. This allows us to apply models trained on languages with a lot of data, e.g. English to low-resource languages. In the following, we will survey models that seek to learn cross-lingual embeddings. We will discuss them based on the type of approach and the nature of parallel data that they employ. Finally, we will present challenges and summarize how to evaluate cross-lingual embedding models.
Chapter
Most of research on the IPC automatic classification system has focused on applying various existing machine learning methods to the patent documents rather than considering the characteristics of the data or the structure of the patent documents. This paper, therefore, proposes using two structural fields, a technical field and a background field which are selected by applying the characteristics of patent documents and the role of the structural fields. A multi-label classification model is also constructed to reflect that a patent document could have multiple IPCs and to classify patent documents at an IPC subclass level comprised of 630 categories. The effects of the structural fields of the patent documents are examined using 564, 793 registered patents in Korea. An 87.2% precision rate is obtained when using the two fields mainly. From this sequence, it is verified that the technical field and background field play an important role in improving the precision of IPC multi-label classification at the IPC subclass level.
Article
An automatic patent categorization system would be invaluable to individual inventors and patent attorneys, saving them time and effort by quickly identifying conflicts with existing patents. In recent years, it has become more and more common to classify all patent documents using the International Patent Classification (IPC), a complex hierarchical classification system comprised of eight sections, 128 classes, 648 subclasses, about 7200 main groups, and approximately 72,000 subgroups. So far, however, no patent categorization method has been developed that can classify patents down to the subgroup level (the bottom level of the IPC). Therefore, this paper presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy. The experimental results for the TPC algorithm, using the WIPO-alpha collection, indicate that our classification method can achieve 36.07% accuracy at the subgroup level. This is approximately a 25,764-fold improvement over a random guess.
Article
Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.
Article
The categorization of patent documents is a difficult task that we study how to automate most accurately. We report the results of applying a variety of machine learning algorithms for training expert systems in German-language patent classification tasks. The taxonomy employed is the International Patent Classification, a complex hierarchical classification scheme in which we make use of 115 classes and 367 subclasses. The system is designed to handle natural language input in the form of the full text of a patent application. The effect on the categorization precision of indexing either the patent claims or the patent descriptions is reported. We describe several ways of measuring the categorization success that account for the attribution of multiple classification codes to each patent document. We show how the hierarchical information inherent in the taxonomy can be used to improve automated categorization precision. Our results are compared to an earlier study of automated English-language patent categorization.
Article
Received 1 September 2005; accepted 29 May 2006; Available online 26 March 2007 The number of patent documents is currently rising rapidly worldwide, creating the need for an automatic categorization system to replace time-consuming and labor-intensive manual categorization. Because accurate patent classification is crucial to search for relevant existing patents in a certain field, patent categorization is a very important and useful field. As patent documents are structural documents with their own characteristics distinguished from general documents, these unique traits should be considered in the patent categorization process. In this paper, we categorize Japanese patent documents automatically, focusing on their characteristics: patents are structured by claims, purposes, effects, embodiments of the invention, and so on. We propose a patent document categorization method that uses the k-NN (k-Nearest Neighbour) approach. In order to retrieve similar documents from a training document set, some specific components to denote the socalled semantic elements, such as claim, purpose, and application field, are compared instead of the whole texts. Because those specific components are identified by various user-defined tags, first all of the components are clustered into several semantic elements. Such semantically clustered structural components are the basic features of patent categorization. We can achieve a 74% improvement of categorization performance over a baseline system that does not use the structural information of the patent.
Article
A solutionT of the least-squares problemAT=B +E, givenA andB so that trace (E′E)= minimum andT′T=I is presented. It is compared with a less general solution of the same problem which was given by Green [5]. The present solution, in contrast to Green's, is applicable to matricesA andB which are of less than full column rank. Some technical suggestions for the numerical computation ofT and an illustrative example are given.
A survey of automated hierarchical classification of patents
  • JC Gomez
  • M-F Moens
  • G Paltoglou
  • F Loizides
  • P Hansen
Bert: pre-training of deep bidirectional transformers for language understanding
  • J Devlin
  • M W Chang
  • K Lee
  • K Toutanova
Cross-lingual ability of multilingual BERT: an empirical study
  • Z Wang
  • S Mayhew
  • D Roth