Conference Paper

PatMedia: Augmenting Patent Search with Content-Based Image Retrieval

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recently, the intellectual property and information retrieval communities have shown increasing interest in image retrieval, which could augment the current practices of patent search. In this context, this article presents PatMedia search engine, which is capable of retrieving patent images in content-based manner. PatMedia is evaluated both by presenting results considering information retrieval metrics, as well as realistic patent search scenarios.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The patent industry involves the management and tracking of an enormous amount of data, much of which takes the form of scientific drawings, technical diagrams and hand sketched models. The comparison of figures across this dataset and subsequent retrieval based on similarity in real-time is extremely challenging [15], [29], [30]. We aim to track the spread of technical information by finding copies and modified copies of technical diagrams in patent databases and academic journals. ...
... While this approach is invariant to rigid body transformations, the size of the CDM is dependent on image resolution and the resulting processes are inefficient both computationally and memory-wise. The Adaptive Hierarchical Density Histogram (AHDH) method [22] along with the retrieval framework PATMEDIA [30] exploits both local and global content. It uses both content-based (i.e image-based) as well as concept-based (text-based) retrieval and claims joint retrieval using both text and image give better retrieval performance. ...
... The patent industry involves the management and tracking of an enormous amount of data, much of which takes the form of scientific drawings, technical diagrams and hand sketched models. The comparison of figures across this dataset and subsequent retrieval based on similarity in real-time is extremely challenging [15], [29], [30]. We aim to track the spread of technical information by finding copies and modified copies of technical diagrams in patent databases and academic journals. ...
... While this approach is invariant to rigid body transformations, the size of the CDM is dependent on image resolution and the resulting processes are inefficient both computationally and memory-wise. The Adaptive Hierarchical Density Histogram (AHDH) method [22] along with the retrieval framework PATMEDIA [30] exploits both local and global content. It uses both content-based (i.e image-based) as well as concept-based (text-based) retrieval and claims joint retrieval using both text and image give better retrieval performance. ...
Preprint
Full-text available
Resolution of the complex problem of image retrieval for diagram images has yet to be reached. Deep learning methods continue to excel in the fields of object detection and image classification applied to natural imagery. However, the application of such methodologies applied to binary imagery remains limited due to lack of crucial features such as textures,color and intensity information. This paper presents a deep learning based method for image-based search for binary patent images by taking advantage of existing large natural image repositories for image search and sketch-based methods (Sketches are not identical to diagrams, but they do share some characteristics; for example, both imagery types are gray scale (binary), composed of contours, and are lacking in texture). We begin by using deep learning to generate sketches from natural images for image retrieval and then train a second deep learning model on the sketches. We then use our small set of manually labeled patent diagram images via transfer learning to adapt the image search from sketches of natural images to diagrams. Our experiment results show the effectiveness of deep learning with transfer learning for detecting near-identical copies in patent images and querying similar images based on content.
... With a view to providing a qualitative evaluation of the patent retrieval engine, we have uploaded around 320.000 images from about 15.000 patents from IPC G03F007/20 (relevant to lithography) extracted from MAREC database and performed specific patent search cases. Specifically we demonstrate two interaction modes by considering two patent search scenarios [33]. ...
Chapter
Nowadays most of the patent search systems still rely upon text to provide retrieval functionalities. Recently, the intellectual property and information retrieval communities have shown great interest in patent image retrieval, which could augment the current practices of patent search. In this chapter, we present a patent image extraction and retrieval framework, which deals with patent image extraction and multimodal (textual and visual) metadata generation from patent images with a view to provide content-based search and concept-based retrieval functionalities. Patent image extraction builds upon page orientation detection and segmentation, while metadata extraction from images is based on the generation of low level visual and textual features. The content-based retrieval functionality is based on visual low level features, which have been devised to deal with complex black and white drawings. Extraction of concepts builds upon on a supervised machine learning framework realised with Support Vector Machines and a combination of visual and textual features. We evaluate the different retrieval parts of the framework by using a dataset from the footwear and the lithography domain.
Article
Patent documents are important intellectual resources of protecting interests of individuals, organizations and companies. Different from general web documents, patent documents have a well-defined format including frontpage, description, nclaims, and figures. However, they are lengthy and rich in technical terms, which requires enormous human efforts for analysis. Hence, a new research area, called patent mining, emerges in recent years, aiming to assist patent analysts in investigating, processing, and analyzing patent documents. Despite the recent advances in patent mining, it is still far from being well explored in research communities. To help patent analysts and interested readers obtain a big picture of patent mining, we thus provide a systematic summary of existing research efforts along this direction. In this survey, we first present an overview of the technical trend in patent mining. We then investigate multiple research questions related to patent documents, including patent retrieval, patent classification, and patent visualization, and provide summaries and highlights for each question by delving into the corresponding research efforts.
Article
Patent documents with sophisticated technical information are valuable for developing new technologies and products. They can be written in almost any language, leading to language barrier problems during retrieval. Traditionally, cross-language information retrieval and cross-language document matching have used text-translation-based or index-set-mapping methods. There are several challenges to the traditional methods, however, such as difficulties with natural language translation, complications owing to bilingual or multi-lingual translations (translating between two or more than two languages), and the unavailability of a parallel dual-language document set. This study offers a new and robust solution to cross-language patent document matching: the International Patent Classification (IPC) based concept bridge approach. The proposed method applies Latent Semantic Indexing to extract concepts from each set of patent documents and utilizes the IPC codes to construct a cross-language mediator that expresses patent documents in different languages. Experiments were carried out to demonstrate the performance of the proposed method. There were 3000 English patents and 3000 Chinese patents gathered as training documents from the United States Patent and Trademark Office and the Taiwan Intellectual Property Office, respectively. Another 30 English patents and another 30 Chinese patents were collected to be query patents. Finally, evaluations using an objective measure and subjective judgement were conducted to prove the feasibility and effectiveness of our method. The results show that our method out-performs the traditional text-translation methods.
Conference Paper
Full-text available
This paper proposes a novel binary image descriptor, namely the Adaptive Hierarchical Density Histogram, that can be utilized for complex binary image retrieval. This novel descriptor exploits the distribution of the image points on a two-dimensional area. To reflect effectively this distribution, we propose an adaptive pyramidal decomposition of the image into non-overlapping rectangular regions and the extraction of the density histogram of each region. This hierarchical decomposition algorithm is based on the recursive calculation of geometric centroids. The presented technique is experimentally shown to combine efficient performance, low computational cost and scalability. Comparison with other prevailing approaches demonstrates its high potential.
Conference Paper
Full-text available
A patent always contains some images along with the text. Many text based systems have been developed to search the patent database. In this paper, we describe PATSEEK that is an image based search system for US patent database. The objective is to let the user check the similarity of his query image with the images that exist in US patents. The user can specify a set of key words that must exist in the text of the patents whose images will be searched for similarity. PATSEEK automatically grabs images from the US patent database on the request of the user and represents them through an edge orientation autocorrelogram. L1 and L2 distance measures are used to compute the distance between the images. A recall rate of 100% for 61% of query images and an average 32% recall rate for rest of the images has been observed.
Article
In this article, we discuss the potential benefits, the requirements and the challenges involved in patent image retrieval and subsequently, we propose a framework that encompasses advanced image analysis and indexing techniques to address the need for content-based patent image search and retrieval. The proposed framework involves the application of document image pre-processing, image feature and textual metadata extraction in order to support effectively content-based image retrieval in the patent domain. To evaluate the capabilities of our proposal, we implemented a patent image search engine. Results based on a series of interaction modes, comparison with existing systems and a quantitative evaluation of our engine provide evidence that image processing and indexing technologies are currently sufficiently mature to be integrated in real-world patent retrieval applications.
Content-Based Retrieval of Complex Binary Images
  • P Sidiropoulos
  • S Vrochidis
  • I Kompatsiaris